pith. sign in

arxiv: 1907.07958 · v1 · pith:YTAZHZDEnew · submitted 2019-07-18 · 💻 cs.AI · cs.RO

Transfer Learning Across Simulated Robots With Different Sensors

Pith reviewed 2026-05-24 19:56 UTC · model grok-4.3

classification 💻 cs.AI cs.RO
keywords transfer learningreinforcement learningpolicy shapingrobot sensorssimulated robotsproximity sensorscamera imagesBDPI
0
0 comments X

The pith

A policy learned with proximity sensors transfers to guide learning from camera images via Policy Shaping.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that an agent trained with Bootstrapped Dual Policy Iteration on proximity sensors can supply a useful policy for a second agent that must solve the identical task from raw camera images alone. The state representations differ sharply, yet Policy Shaping lets the target agent learn without separate alignment steps. This setup models the practical gap between lab robots equipped with many sensors and field robots limited to cheaper ones. A sympathetic reader would see it as evidence that transfer across sensor modalities is feasible in continuous-state simulated tasks.

Core claim

We train a BDPI agent embodied in the V-Rep simulator that senses its environment through several proximity sensors and obtains a policy for a given task. That policy is then supplied to a second BDPI agent learning the same task in the same environment but receiving only camera images as input. Policy Shaping incorporates the source policy to steer the target agent's updates, allowing the camera agent to acquire a working policy despite the mismatched state spaces.

What carries the argument

Policy Shaping, the mechanism that injects the source policy's action preferences into the target agent's learning updates to bridge differing sensor-derived state spaces.

If this is right

  • The camera-based agent acquires a functional policy for the simulated task.
  • Transfer succeeds for continuous-state, discrete-action problems in the V-Rep environment.
  • Expensive sensors need only be available during lab training, not during deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same shaping approach could be tested on real robots moving from lab sensor suites to field camera-only hardware.
  • Comparing Policy Shaping against other transfer techniques on the identical sensor mismatch would reveal relative robustness.
  • Extending the method to tasks with partial observability or noisy images would test its limits under more realistic conditions.

Load-bearing premise

Policy Shaping alone can bridge the large mismatch between proximity-sensor readings and raw camera images without extra alignment or severe performance loss.

What would settle it

Run the camera-input agent with and without Policy Shaping and check whether task success rate or learning speed remains comparable to the proximity-sensor agent.

read the original abstract

For a robot to learn a good policy, it often requires expensive equipment (such as sophisticated sensors) and a prepared training environment conducive to learning. However, it is seldom possible to perfectly equip robots for economic reasons, nor to guarantee ideal learning conditions, when deployed in real-life environments. A solution would be to prepare the robot in the lab environment, when all necessary material is available to learn a good policy. After training in the lab, the robot should be able to get by without the expensive equipment that used to be available to it, and yet still be guaranteed to perform well on the field. The transition between the lab (source) and the real-world environment (target) is related to transfer learning, where the state-space between the source and target tasks differ. We tackle a simulated task with continuous states and discrete actions presenting this challenge, using Bootstrapped Dual Policy Iteration, a model-free actor-critic reinforcement learning algorithm, and Policy Shaping. Specifically, we train a BDPI agent, embodied by a virtual robot performing a task in the V-Rep simulator, sensing its environment through several proximity sensors. The resulting policy is then used by a second agent learning the same task in the same environment, but with camera images as input. The goal is to obtain a policy able to perform the task relying on merely camera images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that a policy learned by a BDPI agent on a simulated robot using proximity sensors can be transferred via Policy Shaping to a second BDPI agent performing the identical task but receiving only camera images as input, thereby enabling transfer across mismatched state spaces without retraining from scratch.

Significance. A working demonstration of sensor-agnostic policy transfer in continuous-state robotics would be useful for reducing hardware requirements at deployment. The work applies existing methods (BDPI and Policy Shaping) to a V-Rep task but supplies no quantitative results, baselines, or ablation data, so the practical significance cannot yet be assessed.

major comments (2)
  1. [Abstract] Abstract (paragraph beginning 'The resulting policy is then used...'): Policy Shaping requires the source policy to be evaluated on the target agent's current state to bias action selection. The manuscript states that the state spaces differ (proximity vectors vs. raw images) but provides no mapping, shared embedding, auxiliary network, or other mechanism that would allow the proximity-trained policy to be queried on camera states. This renders the transfer step undefined.
  2. [Abstract] Abstract and methods description: No performance numbers, success rates, learning curves, or comparison against a camera-only baseline are supplied. Without these data it is impossible to determine whether transfer occurs, whether it improves over learning from scratch, or whether catastrophic forgetting or performance collapse occurs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph beginning 'The resulting policy is then used...'): Policy Shaping requires the source policy to be evaluated on the target agent's current state to bias action selection. The manuscript states that the state spaces differ (proximity vectors vs. raw images) but provides no mapping, shared embedding, auxiliary network, or other mechanism that would allow the proximity-trained policy to be queried on camera states. This renders the transfer step undefined.

    Authors: We agree that the abstract does not specify how the source policy is evaluated on target states with mismatched inputs. The manuscript applies Policy Shaping for transfer but lacks an explicit description of any state mapping or compatibility mechanism. We will revise the abstract and methods sections to clarify the transfer procedure, including how the source policy biases the target agent despite differing state representations. revision: yes

  2. Referee: [Abstract] Abstract and methods description: No performance numbers, success rates, learning curves, or comparison against a camera-only baseline are supplied. Without these data it is impossible to determine whether transfer occurs, whether it improves over learning from scratch, or whether catastrophic forgetting or performance collapse occurs.

    Authors: We concur that the manuscript provides no quantitative results, which prevents evaluation of the transfer's effectiveness. We will add experimental results to the revised manuscript, including success rates, learning curves, and direct comparisons against a camera-only baseline agent to demonstrate whether transfer improves performance over learning from scratch. revision: yes

Circularity Check

0 steps flagged

Empirical application of existing algorithms; no derivation chain present

full rationale

The paper reports an experimental transfer-learning setup that applies the pre-existing BDPI algorithm and Policy Shaping to two agents with mismatched sensors. No equations, fitted parameters, or first-principles derivations are introduced. The central claim is an empirical observation rather than a mathematical reduction; any citations to the authors' prior work on BDPI describe independent, externally published methods and do not function as load-bearing self-references. The work is therefore self-contained against external benchmarks and exhibits no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, fitted constants, or new postulates; ledger is therefore empty.

pith-pipeline@v0.9.0 · 5781 in / 1017 out tokens · 18848 ms · 2026-05-24T19:56:05.555050+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.