Touch-R1: Reinforcing Touch Reasoning in MLLMs

Fucai Zhu; Siyu Zhu; Weihao Yuan; Yafei Zhou; Yingxin Lai

arxiv: 2605.27154 · v1 · pith:JX5PQGKKnew · submitted 2026-05-26 · 💻 cs.CV

Touch-R1: Reinforcing Touch Reasoning in MLLMs

Yingxin Lai , Yafei Zhou , Fucai Zhu , Siyu Zhu , Weihao Yuan This is my paper

Pith reviewed 2026-06-29 18:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords tactile reasoningmultimodal large language modelsreinforcement learningGRPOtactile datasetphysical groundingvisual-tactile conflict

0 comments

The pith

Touch-R1 trains a 7B multimodal model with a GRPO reward that credits only genuine tactile inputs over removed or shuffled controls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that rule-based reinforcement learning can ground tactile reasoning in multimodal models by using an objective that enforces physical evidence over visual priors. It introduces a 1M-pair dataset across four sensors and a benchmark for perception and conflict resolution to support this training. The tactile-use reward component assigns credit only when authentic inputs outperform counterfactual versions, addressing ordinal attributes and sensor shifts. This produces a model that outperforms larger baselines while generating structured traces with probing, comparison, and revision steps.

Core claim

Touch-R1, built on Qwen2.5-VL-7B, is trained via a tactile-grounded GRPO objective combining ordinal-aware accuracy, cross-sensor consistency, format control, and an input-side grounding term; the tactile-use reward gives credit solely when real tactile streams improve correctness relative to removed, shuffled, or noise-masked controls. On TouchReason-Bench the resulting 7B model exceeds Octopi-13B by 18.4 percent and GPT-4o by 24.7 percent on average, with reasoning traces exhibiting emergent probing, comparison, and revision behaviors.

What carries the argument

tactile-use reward inside the GRPO objective that assigns credit only when authentic tactile inputs outperform counterfactual controls (removed, shuffled, or noise-masked)

If this is right

Structured reasoning traces emerge that include probing, comparison, and revision steps.
The 7B model achieves 18.4 percent higher average accuracy than Octopi-13B and 24.7 percent higher than GPT-4o on the benchmark.
Predictions become grounded in physical contact rather than relying on potentially misleading visual priors.
Cross-sensor physical consistency is maintained across four distinct tactile hardware types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same counterfactual-reward pattern could be tested on other sensory streams where removal or corruption of the signal is straightforward.
Robotic systems that must resolve visual-tactile conflicts in real time might benefit from the same structured traces.
The benchmark could serve as a diagnostic for whether vision-language models default to visual heuristics even when touch data is supplied.

Load-bearing premise

The tactile-use reward correctly forces physical grounding instead of simply rewarding patterns that happen to appear in the training data.

What would settle it

A controlled ablation in which the tactile-use reward is removed yet the model still shows the same accuracy gains and emergent behaviors on TouchReason-Bench.

Figures

Figures reproduced from arXiv: 2605.27154 by Fucai Zhu, Siyu Zhu, Weihao Yuan, Yafei Zhou, Yingxin Lai.

**Figure 2.** Figure 2: Overview of TouchReason-1M collected with four optical tactile sensors on 1000+ objects across [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of TouchReason QA Pairs construction. The upper panel shows data collection, human [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Touch-R1: a three-stage framework for tactile-grounded reasoning. (1) Tactile Dynamics Pretraining: a ViT touch encoder is pretrained by predicting future tactile tokens from past ones, capturing deformation dynamics across optical tactile sensors. (2) QA Supervised Fine-Tuning: tactile, visual, and text tokens are aligned to a Qwen2.5-VL-7B backbone on TouchReason-1M, supervising the assistant to answer u… view at source ↗

**Figure 5.** Figure 5: Scaling behavior of Touch-R1 across 3B, 7B, and 14B backbones on TouchReason-Bench. SOI- [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative examples of Touch-R1 reasoning. The model uses tactile evidence to estimate hardness [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

While rule-based reinforcement learning has recently catalyzed explicit reasoning in multimodal models, tactile reasoning remains largely underexplored. Existing tactile-language models primarily rely on supervised or contrastive objectives, which limits their capacity to ground predictions in physical evidence or rectify misleading visual priors. Tactile reasoning introduces two modality-specific challenges: the ordinal nature of physical attributes (e.g., hardness, roughness) and the cross-sensor distribution shifts inherent in optical tactile hardware. In this work, we introduce TouchReason-1M, a large-scale multimodal dataset comprising over 1M synchronized tactile pairs across four distinct sensors, and TouchReason-Bench, a rigorous framework for evaluating tactile perception and visual-tactile conflict resolution. Building upon these, we propose Touch-R1, a tactile reasoning MLLM based on Qwen2.5-VL-7B. Touch-R1 is trained via a tactile-grounded GRPO objective that combines ordinal-aware accuracy, cross-sensor physical consistency, structured-format control, and an input-side tactile grounding objective. Specifically, the tactile-use reward assigns credit only when authentic tactile inputs yield superior correctness relative to counterfactual controls where the tactile stream is removed, shuffled, or noise-masked. On TouchReason-Bench, Touch-R1-7B outperforms Octopi-13B by 18.4\% and GPT-4o by 24.7\% on average. Its structured reasoning traces reveal emergent behaviors of probing, comparison, and revision, demonstrating that R1-style reasoning can be effectively grounded in physical contact.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Touch-R1 adds a new multi-sensor tactile dataset and benchmark plus a GRPO objective with counterfactual rewards, but the grounding claim rests on an unproven assumption about what the reward actually enforces.

read the letter

The paper's core offering is TouchReason-1M, a 1M-pair dataset of synchronized tactile readings from four sensors, and TouchReason-Bench for measuring tactile perception plus visual-tactile conflict handling. They fine-tune Qwen2.5-VL-7B with a GRPO objective that adds ordinal-aware accuracy, cross-sensor consistency, format control, and a tactile-use reward. That reward gives credit only when the model scores higher with real tactile input than with the same input removed, shuffled, or noise-masked.

This produces the reported gains: Touch-R1-7B beats Octopi-13B by 18.4% and GPT-4o by 24.7% on average, with traces that show probing, comparison, and revision steps. The dataset scale and the explicit counterfactual term are the clearest additions over prior supervised tactile models.

The approach is reasonable on paper. The reward tries to push the model past visual priors by requiring the tactile stream to improve correctness. The multi-sensor consistency term is a sensible attempt to handle distribution shifts.

The soft spot is the reward itself. The design does not rule out the model learning to recognize the statistical signature of an authentic tactile stream and emitting a different answer distribution without ever using the actual sensor values for properties like hardness or roughness. The added consistency terms may not block that shortcut when four distinct sensor distributions are already visible to the base model. No ablations or controls in the abstract address this directly.

Performance numbers appear without error bars, baseline implementation details, or statistical tests, so the size of the improvement is difficult to judge. The emergent behaviors are noted but not quantified beyond the traces.

This work is aimed at researchers extending RL reasoning methods to tactile or embodied multimodal settings. Readers who care about grounding objectives would want to examine the exact reward equations and any ablations.

The dataset and benchmark are concrete enough to justify peer review. The central grounding claim needs referee scrutiny on the reward design and experimental controls, but the paper is coherent enough on its own terms to go out rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces the TouchReason-1M dataset (>1M synchronized tactile pairs across four sensors) and TouchReason-Bench for tactile perception and visual-tactile conflict resolution. It proposes Touch-R1 (Qwen2.5-VL-7B) trained via tactile-grounded GRPO combining ordinal-aware accuracy, cross-sensor physical consistency, structured-format control, and a tactile-use reward that assigns credit only when authentic tactile inputs outperform counterfactual controls (removed, shuffled, noise-masked). On TouchReason-Bench the 7B model outperforms Octopi-13B by 18.4% and GPT-4o by 24.7% on average and exhibits emergent probing/comparison/revision behaviors in reasoning traces.

Significance. If the central claims hold after verification of the reward mechanism, the work would be significant for demonstrating that R1-style RL can ground multimodal reasoning in ordinal physical attributes and mitigate sensor shifts, moving beyond supervised/contrastive tactile models.

major comments (2)

[Abstract] Abstract (tactile-use reward paragraph): the design credits the policy only when correctness with real tactile exceeds the three controls, but provides no analysis showing that outperformance requires extraction of ordinal properties (hardness, roughness) rather than detection of unperturbed stream signatures (sensor noise statistics or pairing patterns in the 1M pairs). This is load-bearing for the physical-grounding claim.
[GRPO objective] GRPO objective description: the cross-sensor consistency and ordinal-aware accuracy terms are stated to be added, but no equations or ablations demonstrate they are sufficient to block a meta-detector that recognizes authentic input distributions without consulting tactile values; the four distinct sensor distributions make this a concrete risk.

minor comments (2)

[Results] Performance numbers (18.4%, 24.7%) are reported without error bars, statistical tests, or baseline implementation details.
[Dataset] Dataset construction details for TouchReason-1M (synchronization across sensors, annotation protocol) are referenced but lack sufficient specificity for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and will incorporate revisions to strengthen the evidence for physical grounding in the tactile-use reward and GRPO objective.

read point-by-point responses

Referee: [Abstract] Abstract (tactile-use reward paragraph): the design credits the policy only when correctness with real tactile exceeds the three controls, but provides no analysis showing that outperformance requires extraction of ordinal properties (hardness, roughness) rather than detection of unperturbed stream signatures (sensor noise statistics or pairing patterns in the 1M pairs). This is load-bearing for the physical-grounding claim.

Authors: We agree that direct analysis is needed to confirm reliance on ordinal properties rather than sensor signatures. The controls (removed, shuffled, noise-masked) target information content while attempting to retain distribution cues, but we will add ablations in the revision that preserve noise statistics and pairing patterns yet disrupt ordinal values (e.g., label permutation within sensor types). These will quantify the performance drop and support the grounding claim. revision: yes
Referee: [GRPO objective] GRPO objective description: the cross-sensor consistency and ordinal-aware accuracy terms are stated to be added, but no equations or ablations demonstrate they are sufficient to block a meta-detector that recognizes authentic input distributions without consulting tactile values; the four distinct sensor distributions make this a concrete risk.

Authors: We acknowledge the absence of explicit equations and ablations. In the revised manuscript we will include the full mathematical formulation of the GRPO objective with each term and report ablations that isolate the cross-sensor consistency and ordinal-aware accuracy components. These will demonstrate increased vulnerability to distribution-based meta-detection when the terms are removed, addressing the risk across the four sensors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent components

full rationale

The paper defines a new dataset (TouchReason-1M), benchmark (TouchReason-Bench), and GRPO objective with a tactile-use reward that credits only when real tactile inputs outperform three explicit counterfactual controls. No equations, fitted parameters renamed as predictions, or self-citations appear in the provided text that would reduce the claimed performance gains or emergent behaviors to tautological inputs. The reward construction is presented as a design choice rather than derived from prior results by the same authors. The central empirical claims rest on outperformance against external baselines (Octopi-13B, GPT-4o) on the newly introduced benchmark, which supplies an independent evaluation axis. This meets the criteria for a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no identifiable free parameters, axioms, or invented entities; the central claim rests on unstated implementation details of the GRPO rewards and benchmark construction.

pith-pipeline@v0.9.1-grok · 5817 in / 1234 out tokens · 38554 ms · 2026-06-29T18:51:55.660743+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 26 canonical work pages · 15 internal anchors

[1]

Claude 3.5 Sonnet

Anthropic. Claude 3.5 Sonnet. [Online]. Available: https://www.anthropic.com/news/ claude-3-5-sonnet, 2024. Accessed: 2026-05-06

2024
[2]

S Bai, K Chen, X Liu, J Wang, W Ge, S Song, K Dang, P Wang, S Wang, J Tang, et al. Qwen2. 5-vl technical report (no. arxiv: 2502.13923). arxiv, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

The feeling of success: Does touch sensing help predict grasp outcomes?arXiv preprint arXiv:1710.05512, 2017

Roberto Calandra, Andrew Owens, Manu Upadhyaya, Wenzhen Yuan, Justin Lin, Edward H Adelson, and Sergey Levine. The feeling of success: Does touch sensing help predict grasp outcomes?arXiv preprint arXiv:1710.05512, 2017

work page arXiv 2017
[4]

Grpo- care: Consistency-aware reinforcement learning for multimodal reasoning.arXiv preprint arXiv:2506.16141, 2025

Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, and Xihui Liu. Grpo- care: Consistency-aware reinforcement learning for multimodal reasoning.arXiv preprint arXiv:2506.16141, 2025

work page arXiv 2025
[5]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Stola: Self-adaptive touch-language framework for tactile commonsense reasoning in open-ended scenarios

Ning Cheng, Jinan Xu, Jialing Chen, Bin Fang, and Wenjuan Han. Stola: Self-adaptive touch-language framework for tactile commonsense reasoning in open-ended scenarios. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18198–18206, 2026

2026
[7]

Touch100k: A large-scale touch-language-vision dataset for touch-centric multimodal representation.Information Fusion, 124:103305, 2025

Ning Cheng, Jinan Xu, Changhao Guan, Jing Gao, Weihao Wang, You Li, Fandong Meng, Jie Zhou, Bin Fang, and Wenjuan Han. Touch100k: A large-scale touch-language-vision dataset for touch-centric multimodal representation.Information Fusion, 124:103305, 2025

2025
[8]

DM-Tac X

Daimon (Shenzhen) Robotics Technology Co., Ltd. DM-Tac X. [Online]. Available: https: //www.dmrobot.com/en/product/p1/dm-tac-x.html, 2026. Accessed: 2026-05-01

2026
[9]

Stochastic video generation with a learned prior

Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. InInternational conference on machine learning, pages 1174–1183. PMLR, 2018

2018
[10]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Anytouch: Learning unified static-dynamic representation across multiple visuo-tactile sensors

Ruoxuan Feng, Jiangyu Hu, Wenke Xia, Tianci Gao, Ao Shen, Yuhao Sun, Bin Fang, and Di Hu. Anytouch: Learning unified static-dynamic representation across multiple visuo-tactile sensors. arXiv preprint arXiv:2502.12191, 2025

work page arXiv 2025
[12]

A touch, vision, and language dataset for multimodal alignment.arXiv preprint arXiv:2402.13232, 2024

Letian Fu, Gaurav Datta, Huang Huang, William Chung-Ho Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, and Ken Goldberg. A touch, vision, and language dataset for multimodal alignment.arXiv preprint arXiv:2402.13232, 2024

work page arXiv 2024
[13]

Comparing correspondences: Video pre- diction with correspondence-wise losses

Daniel Geng, Max Hamilton, and Andrew Owens. Comparing correspondences: Video pre- diction with correspondence-wise losses. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3365–3376, 2022

2022
[14]

Gemini 2.5 Pro Preview 05-06

Google DeepMind. Gemini 2.5 Pro Preview 05-06. [Online]. Available: https://ai. google.dev/gemini-api/docs/models#gemini-2.5-pro-preview-05-06 , 2025. Ac- cessed: 2026-05-06

2025
[15]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Sparsh: Self-supervised touch representations for vision-based tactile sensing.arXiv preprint arXiv:2410.24090, 2024

Carolina Higuera, Akash Sharma, Chaithanya Krishna Bodduluri, Taosha Fan, Patrick Lan- caster, Mrinal Kalakrishnan, Michael Kaess, Byron Boots, Mike Lambeta, Tingfan Wu, et al. Sparsh: Self-supervised touch representations for vision-based tactile sensing.arXiv preprint arXiv:2410.24090, 2024. 10

work page arXiv 2024
[17]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation.IEEE Robotics and Automation Letters, 5(3):3838–3845, 2020

Mike Lambeta, Po-Wei Chou, Stephen Tian, Brian Yang, Benjamin Maloon, Victoria Rose Most, Dave Stroud, Raymond Santos, Ahmad Byagowi, Gregg Kammerer, et al. Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation.IEEE Robotics and Automation Letters, 5(3):3838–3845, 2020

2020
[20]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Connecting touch and vision via cross-modal prediction

Yunzhu Li, Jun-Yan Zhu, Russ Tedrake, and Antonio Torralba. Connecting touch and vision via cross-modal prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10609–10618, 2019

2019
[22]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Visual-rft: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044, 2025

2034
[24]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Photon - Tactile Sensor

Xense Robotics. Photon - Tactile Sensor. https://www.xenserobotics.com/product/ 367222/detail/15, 2025. Accessed: 2026-05-06

2025
[27]

Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025

Jiaer Xia, Yuhang Zang, Peng Gao, Sharon Li, and Kaiyang Zhou. Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025

work page arXiv 2025
[28]

Universal visuo-tactile video understanding for embodied interaction, 2025

Yifan Xie, Mingyang Li, Shoujie Li, Xingting Li, Guangyu Chen, Fei Ma, Fei Richard Yu, and Wenbo Ding. Universal visuo-tactile video understanding for embodied interaction, 2025

2025
[29]

Binding touch to everything: Learning unified multimodal tactile representations.arXiv:2401.18084, 2024

Fengyu Yang, Chao Feng, Ziyang Chen, Hyoungseob Park, Daniel Wang, Yiming Dou, Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, and Alex Wong. Binding touch to everything: Learning unified multimodal tactile representations.arXiv:2401.18084, 2024

work page arXiv 2024
[30]

Touch and go: Learning from human-collected vision and touch.arXiv preprint arXiv:2211.12498, 2022

Fengyu Yang, Chenyang Ma, Jiacheng Zhang, Jing Zhu, Wenzhen Yuan, and Andrew Owens. Touch and go: Learning from human-collected vision and touch.arXiv preprint arXiv:2211.12498, 2022

work page arXiv 2022
[31]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Octopi: Object property reasoning with large tactile-language models.arXiv preprint arXiv:2405.02794, 2024

Samson Yu, Kelvin Lin, Anxing Xiao, Jiafei Duan, and Harold Soh. Octopi: Object property reasoning with large tactile-language models.arXiv preprint arXiv:2405.02794, 2024

work page arXiv 2024
[33]

Gelsight: High-resolution robot tactile sensors for estimating geometry and force.Sensors, 17(12):2762, 2017

Wenzhen Yuan, Siyuan Dong, and Edward H Adelson. Gelsight: High-resolution robot tactile sensors for estimating geometry and force.Sensors, 17(12):2762, 2017. 11

2017
[34]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Tac3d: A novel vision-based tactile sensor for measuring forces distribution and estimating friction coefficient distribution.arXiv preprint arXiv:2202.06211, 2022

Lunwei Zhang, Yue Wang, and Yao Jiang. Tac3d: A novel vision-based tactile sensor for measuring forces distribution and estimating friction coefficient distribution.arXiv preprint arXiv:2202.06211, 2022

work page arXiv 2022
[36]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data, 2024.URL https://arxiv. org/abs/2410.02713, 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Transferable tactile transform- ers for representation learning across diverse sensors and tasks.arXiv preprint arXiv:2406.13640, 2024

Jialiang Zhao, Yuxiang Ma, Lirui Wang, and Edward H Adelson. Transferable tactile transform- ers for representation learning across diverse sensors and tasks.arXiv preprint arXiv:2406.13640, 2024

work page arXiv 2024
[38]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing

Junyi Zong, Qingxuan Jia, Meixian Shi, Tong Li, Jiayuan Li, Zihang Lv, Gang Chen, and Fang Deng. Vitatouch: Property-aware vision-tactile-language model for robotic quality inspection in manufacturing.arXiv preprint arXiv:2604.03322, 2026. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

Claude 3.5 Sonnet

Anthropic. Claude 3.5 Sonnet. [Online]. Available: https://www.anthropic.com/news/ claude-3-5-sonnet, 2024. Accessed: 2026-05-06

2024

[2] [2]

S Bai, K Chen, X Liu, J Wang, W Ge, S Song, K Dang, P Wang, S Wang, J Tang, et al. Qwen2. 5-vl technical report (no. arxiv: 2502.13923). arxiv, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

The feeling of success: Does touch sensing help predict grasp outcomes?arXiv preprint arXiv:1710.05512, 2017

Roberto Calandra, Andrew Owens, Manu Upadhyaya, Wenzhen Yuan, Justin Lin, Edward H Adelson, and Sergey Levine. The feeling of success: Does touch sensing help predict grasp outcomes?arXiv preprint arXiv:1710.05512, 2017

work page arXiv 2017

[4] [4]

Grpo- care: Consistency-aware reinforcement learning for multimodal reasoning.arXiv preprint arXiv:2506.16141, 2025

Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, and Xihui Liu. Grpo- care: Consistency-aware reinforcement learning for multimodal reasoning.arXiv preprint arXiv:2506.16141, 2025

work page arXiv 2025

[5] [5]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Stola: Self-adaptive touch-language framework for tactile commonsense reasoning in open-ended scenarios

Ning Cheng, Jinan Xu, Jialing Chen, Bin Fang, and Wenjuan Han. Stola: Self-adaptive touch-language framework for tactile commonsense reasoning in open-ended scenarios. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18198–18206, 2026

2026

[7] [7]

Touch100k: A large-scale touch-language-vision dataset for touch-centric multimodal representation.Information Fusion, 124:103305, 2025

Ning Cheng, Jinan Xu, Changhao Guan, Jing Gao, Weihao Wang, You Li, Fandong Meng, Jie Zhou, Bin Fang, and Wenjuan Han. Touch100k: A large-scale touch-language-vision dataset for touch-centric multimodal representation.Information Fusion, 124:103305, 2025

2025

[8] [8]

DM-Tac X

Daimon (Shenzhen) Robotics Technology Co., Ltd. DM-Tac X. [Online]. Available: https: //www.dmrobot.com/en/product/p1/dm-tac-x.html, 2026. Accessed: 2026-05-01

2026

[9] [9]

Stochastic video generation with a learned prior

Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. InInternational conference on machine learning, pages 1174–1183. PMLR, 2018

2018

[10] [10]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Anytouch: Learning unified static-dynamic representation across multiple visuo-tactile sensors

Ruoxuan Feng, Jiangyu Hu, Wenke Xia, Tianci Gao, Ao Shen, Yuhao Sun, Bin Fang, and Di Hu. Anytouch: Learning unified static-dynamic representation across multiple visuo-tactile sensors. arXiv preprint arXiv:2502.12191, 2025

work page arXiv 2025

[12] [12]

A touch, vision, and language dataset for multimodal alignment.arXiv preprint arXiv:2402.13232, 2024

Letian Fu, Gaurav Datta, Huang Huang, William Chung-Ho Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, and Ken Goldberg. A touch, vision, and language dataset for multimodal alignment.arXiv preprint arXiv:2402.13232, 2024

work page arXiv 2024

[13] [13]

Comparing correspondences: Video pre- diction with correspondence-wise losses

Daniel Geng, Max Hamilton, and Andrew Owens. Comparing correspondences: Video pre- diction with correspondence-wise losses. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3365–3376, 2022

2022

[14] [14]

Gemini 2.5 Pro Preview 05-06

Google DeepMind. Gemini 2.5 Pro Preview 05-06. [Online]. Available: https://ai. google.dev/gemini-api/docs/models#gemini-2.5-pro-preview-05-06 , 2025. Ac- cessed: 2026-05-06

2025

[15] [15]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Sparsh: Self-supervised touch representations for vision-based tactile sensing.arXiv preprint arXiv:2410.24090, 2024

Carolina Higuera, Akash Sharma, Chaithanya Krishna Bodduluri, Taosha Fan, Patrick Lan- caster, Mrinal Kalakrishnan, Michael Kaess, Byron Boots, Mike Lambeta, Tingfan Wu, et al. Sparsh: Self-supervised touch representations for vision-based tactile sensing.arXiv preprint arXiv:2410.24090, 2024. 10

work page arXiv 2024

[17] [17]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation.IEEE Robotics and Automation Letters, 5(3):3838–3845, 2020

Mike Lambeta, Po-Wei Chou, Stephen Tian, Brian Yang, Benjamin Maloon, Victoria Rose Most, Dave Stroud, Raymond Santos, Ahmad Byagowi, Gregg Kammerer, et al. Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation.IEEE Robotics and Automation Letters, 5(3):3838–3845, 2020

2020

[20] [20]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Connecting touch and vision via cross-modal prediction

Yunzhu Li, Jun-Yan Zhu, Russ Tedrake, and Antonio Torralba. Connecting touch and vision via cross-modal prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10609–10618, 2019

2019

[22] [22]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Visual-rft: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044, 2025

2034

[24] [24]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Photon - Tactile Sensor

Xense Robotics. Photon - Tactile Sensor. https://www.xenserobotics.com/product/ 367222/detail/15, 2025. Accessed: 2026-05-06

2025

[27] [27]

Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025

Jiaer Xia, Yuhang Zang, Peng Gao, Sharon Li, and Kaiyang Zhou. Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025

work page arXiv 2025

[28] [28]

Universal visuo-tactile video understanding for embodied interaction, 2025

Yifan Xie, Mingyang Li, Shoujie Li, Xingting Li, Guangyu Chen, Fei Ma, Fei Richard Yu, and Wenbo Ding. Universal visuo-tactile video understanding for embodied interaction, 2025

2025

[29] [29]

Binding touch to everything: Learning unified multimodal tactile representations.arXiv:2401.18084, 2024

Fengyu Yang, Chao Feng, Ziyang Chen, Hyoungseob Park, Daniel Wang, Yiming Dou, Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, and Alex Wong. Binding touch to everything: Learning unified multimodal tactile representations.arXiv:2401.18084, 2024

work page arXiv 2024

[30] [30]

Touch and go: Learning from human-collected vision and touch.arXiv preprint arXiv:2211.12498, 2022

Fengyu Yang, Chenyang Ma, Jiacheng Zhang, Jing Zhu, Wenzhen Yuan, and Andrew Owens. Touch and go: Learning from human-collected vision and touch.arXiv preprint arXiv:2211.12498, 2022

work page arXiv 2022

[31] [31]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Octopi: Object property reasoning with large tactile-language models.arXiv preprint arXiv:2405.02794, 2024

Samson Yu, Kelvin Lin, Anxing Xiao, Jiafei Duan, and Harold Soh. Octopi: Object property reasoning with large tactile-language models.arXiv preprint arXiv:2405.02794, 2024

work page arXiv 2024

[33] [33]

Gelsight: High-resolution robot tactile sensors for estimating geometry and force.Sensors, 17(12):2762, 2017

Wenzhen Yuan, Siyuan Dong, and Edward H Adelson. Gelsight: High-resolution robot tactile sensors for estimating geometry and force.Sensors, 17(12):2762, 2017. 11

2017

[34] [34]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Tac3d: A novel vision-based tactile sensor for measuring forces distribution and estimating friction coefficient distribution.arXiv preprint arXiv:2202.06211, 2022

Lunwei Zhang, Yue Wang, and Yao Jiang. Tac3d: A novel vision-based tactile sensor for measuring forces distribution and estimating friction coefficient distribution.arXiv preprint arXiv:2202.06211, 2022

work page arXiv 2022

[36] [36]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data, 2024.URL https://arxiv. org/abs/2410.02713, 17

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Transferable tactile transform- ers for representation learning across diverse sensors and tasks.arXiv preprint arXiv:2406.13640, 2024

Jialiang Zhao, Yuxiang Ma, Lirui Wang, and Edward H Adelson. Transferable tactile transform- ers for representation learning across diverse sensors and tasks.arXiv preprint arXiv:2406.13640, 2024

work page arXiv 2024

[38] [38]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing

Junyi Zong, Qingxuan Jia, Meixian Shi, Tong Li, Jiayuan Li, Zihang Lv, Gang Chen, and Fang Deng. Vitatouch: Property-aware vision-tactile-language model for robotic quality inspection in manufacturing.arXiv preprint arXiv:2604.03322, 2026. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026