Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving

Oliver Brock; Oussama Zenkri

arxiv: 2605.20072 · v1 · pith:OPCEYVBUnew · submitted 2026-05-19 · 💻 cs.AI · cs.RO

Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving

Oussama Zenkri , Oliver Brock This is my paper

Pith reviewed 2026-05-20 05:12 UTC · model grok-4.3

classification 💻 cs.AI cs.RO

keywords embodied AIlarge language modelsrobotic agentsperceptual noiseobservation fidelityaction repetitionmechanical puzzles

0 comments

The pith

Embodied LLM agents solve a mechanical puzzle more reliably with raw camera images than with perfect symbolic observations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests large language model agents on a physical robot using the Lockbox, a sequential mechanical puzzle with hidden dependencies. Agents receive either raw RGB images from a camera, RGB-D data, or complete ground-truth symbolic state information. Performance turns out highest with the raw visual input and lowest with the perfect symbolic data. In simulation the authors add controlled flips to the perceived results of actions and find that moderate noise levels raise success rates by interrupting the repetitive action loops that trap agents under clean observations.

Core claim

Embodied LLM agents achieve the best results on the Lockbox task under raw RGB observations and the worst under perfect ground-truth observations. In simulation, randomly flipping perceived action outcomes at a 40 percent probability increases the success rate by a factor of 2.85 over the noise-free baseline. The gain arises because the added noise reduces the frequency of repetitive action sequences that otherwise trap the agent.

What carries the argument

The Lockbox, a sequential mechanical puzzle whose parts have hidden interdependencies, serves as the test environment in which changes to observation type and added perceptual noise expose how perception errors interact with the LLM's reasoning.

Load-bearing premise

The performance gaps arise mainly from the interaction between perceptual noise and the LLM's reasoning rather than from prompt wording, the specific Lockbox design, or the chosen action space.

What would settle it

Re-running the simulation with random action-outcome flips and finding no performance peak near 40 percent noise or no overall gain compared with the noise-free case would show that moderate noise does not improve results.

Figures

Figures reproduced from arXiv: 2605.20072 by Oliver Brock, Oussama Zenkri.

**Figure 2.** Figure 2: Experimental setup with three input modalities for the LLM. Through [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The human-inspired strategy outperforms GPT o1 across all input modal [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Performance exhibits a non-monotonic dependence on perceptual noise. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Perceptual noise is associated with fewer action loops, which are asso [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Large Language Models are increasingly proposed as cognitive components for robotic systems, yet their opaque decision processes make it difficult to explain success or failure in closed-loop embodied tasks. Following an empirical AI methodology, we study embodied LLM agents behaviorally by varying the information available to the agent and measuring the resulting changes in behavior. Using the Lockbox, a sequential mechanical puzzle with hidden interdependencies, we evaluate LLMs across RGB, RGB-D, and ground-truth symbolic observations in a physical robotic setup and use controlled simulation to probe the resulting behavior. Counterintuitively, agents perform best under raw RGB input and worst under perfect ground-truth observations. In simulation, we probe this effect by randomly flipping perceived action outcomes and find that moderate noise improves performance, peaking at a 40% flip probability with a 2.85-fold success rate increase over the noise-free baseline. Further analysis links this gain to a reduction in repetitive action loops. These findings suggest that success rates alone are insufficient for evaluating LLMs, as measured performance may reflect the interaction between perceptual errors and reasoning failures rather than robust problem solving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Moderate noise helps by breaking loops in embodied LLMs, but sim may not capture real perceptual errors from RGB.

read the letter

Moderate noise in the simulation probe improves performance on this Lockbox task by breaking repetitive action loops, which lines up with why raw RGB beats perfect ground-truth observations in the robot experiments. That's the main thing to take away. The paper does a good job running the physical comparisons across observation types and then using simulation to test the noise hypothesis directly. Linking the success gain to fewer loops is a step toward behavioral understanding instead of just success rates. The task choice with its hidden dependencies helps make the point about reasoning under uncertainty. It earns credit for the controlled setup and for not stopping at the counterintuitive result but probing why it happens. The soft spot is the one the stress-test note flags. Uniform random action flips at 40% may not match the actual misperceptions from RGB input, which are likely more structured and not purely random. The paper doesn't show measurements of real error patterns from the RGB condition to validate the model. That leaves open whether the performance peak is general or tied to this particular noise choice. Minor issues include missing trial counts and stats in the abstract, but those are fixable. Readers working on LLM agents in robotics or on better evaluation practices would get value here. It challenges the higher-fidelity-is-better assumption in a concrete way. This deserves a serious referee. The empirical core is there and worth discussing, even if the causal link needs more work. I'd send it for review.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical study of embodied LLM agents solving the Lockbox puzzle, a sequential mechanical task with hidden interdependencies. By varying observation fidelity in a physical robotic setup—comparing raw RGB, RGB-D, and ground-truth symbolic states—the authors find that agents achieve highest success with raw RGB inputs and lowest with perfect observations. Simulation experiments introduce random flips to action outcomes, demonstrating that moderate noise (40% probability) yields a 2.85-fold increase in success rate by reducing repetitive action loops. The work concludes that performance metrics must account for interactions between perceptual noise and reasoning processes rather than assuming higher fidelity always improves outcomes.

Significance. If the central findings hold, the work offers a valuable behavioral probe into how observation fidelity interacts with LLM reasoning in closed-loop embodied tasks. The counterintuitive result that raw RGB outperforms perfect ground-truth, linked to noise breaking repetitive loops, provides a concrete mechanism that could inform more robust agent designs. Strengths include the controlled variation of observation types in physical hardware and the use of targeted simulation probes with a specific, experimentally identified noise level (40% flips) that produces a quantifiable 2.85× gain; these elements supply direct, falsifiable evidence rather than post-hoc fitting.

major comments (2)

[Simulation experiments] Simulation probe section: The claim that moderate perceptual noise explains why raw RGB outperforms ground-truth rests on the assumption that uniform random action-outcome flips reproduce the error statistics of actual RGB observations. The manuscript reports no direct measurements of error rates, error correlations, or state-estimation mistakes observed under the physical RGB condition, so it remains possible that the 2.85-fold peak is an artifact of the particular uniform noise distribution rather than a general interaction between observation fidelity and LLM reasoning.
[Results] Results and experimental details: The reported performance differences (including the 2.85-fold success-rate gain) are presented without explicit trial counts per condition, confidence intervals, or statistical tests for significance. Given the stochastic nature of both LLMs and the physical setup, these details are load-bearing for establishing that the observed ordering (RGB > RGB-D > ground-truth) and the noise-induced improvement are robust rather than sensitive to prompt phrasing or Lockbox-specific mechanics.

minor comments (2)

[Abstract] The abstract would benefit from a one-sentence summary of trial counts and the statistical test supporting the 2.85-fold claim to allow readers to assess the strength of the behavioral conclusion without immediately consulting the full methods.
[Methods] Notation for observation types (RGB vs. RGB-D vs. symbolic) is clear, but a short table summarizing the exact information content provided to the LLM under each condition would improve readability when comparing the physical and simulation results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, with an emphasis on clarifying our experimental intent and strengthening the presentation of results where possible.

read point-by-point responses

Referee: [Simulation experiments] Simulation probe section: The claim that moderate perceptual noise explains why raw RGB outperforms ground-truth rests on the assumption that uniform random action-outcome flips reproduce the error statistics of actual RGB observations. The manuscript reports no direct measurements of error rates, error correlations, or state-estimation mistakes observed under the physical RGB condition, so it remains possible that the 2.85-fold peak is an artifact of the particular uniform noise distribution rather than a general interaction between observation fidelity and LLM reasoning.

Authors: We thank the referee for this observation. The simulation experiments were designed as a controlled probe to isolate whether the introduction of noise could disrupt repetitive action loops, thereby providing a mechanistic explanation for the counterintuitive ordering observed in the physical trials. We do not claim that uniform random flips exactly reproduce the error statistics, correlations, or state-estimation mistakes present in the physical RGB condition; no such direct measurements were performed. In the revised manuscript we will explicitly qualify the noise model as a simplified probe, state its assumptions, and add a dedicated limitations paragraph noting that future work could characterize actual perceptual error distributions from the RGB setup to test generalizability. revision: partial
Referee: [Results] Results and experimental details: The reported performance differences (including the 2.85-fold success-rate gain) are presented without explicit trial counts per condition, confidence intervals, or statistical tests for significance. Given the stochastic nature of both LLMs and the physical setup, these details are load-bearing for establishing that the observed ordering (RGB > RGB-D > ground-truth) and the noise-induced improvement are robust rather than sensitive to prompt phrasing or Lockbox-specific mechanics.

Authors: We agree that greater statistical transparency is warranted given the stochastic elements involved. The revised manuscript will report the exact number of trials conducted for each observation condition and noise level, include confidence intervals on the success rates, and add statistical tests (e.g., proportion tests or ANOVA with post-hoc comparisons) to evaluate the significance of the performance ordering and the 2.85-fold improvement at 40 % noise. These additions will help demonstrate that the reported effects are not artifacts of particular prompt realizations or task-specific mechanics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements and direct simulation probes

full rationale

The paper reports direct experimental results from physical robotic trials with RGB, RGB-D, and ground-truth observations, followed by simulation probes that inject random action-outcome flips at varying probabilities and measure resulting success rates. The 40% flip probability and 2.85-fold gain are identified by testing multiple discrete noise levels and observing the peak, not by fitting a parameter to reproduce a pre-specified target or by any self-referential equation. No load-bearing steps reduce to definitions, fitted inputs renamed as predictions, or self-citation chains; the derivation chain consists of controlled variation and behavioral measurement.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The study rests on standard assumptions about LLM prompting for robotic control and introduces no new physical entities or forces. The noise level is an experimental variable rather than a free parameter fitted to support the main claim.

free parameters (1)

action outcome flip probability
The 40% value is the experimentally identified peak rather than a parameter chosen to force agreement with a theoretical prediction.

axioms (1)

domain assumption LLMs can be prompted to act as closed-loop controllers that map observations to robot actions
The entire experimental design presupposes that LLMs can be used this way in the Lockbox task.

pith-pipeline@v0.9.0 · 5719 in / 1268 out tokens · 50895 ms · 2026-05-20T05:12:41.157878+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

Agrawal, P., Goyal, P.: Addressing llm diversity by infusing random concepts (2026),https://arxiv.org/abs/2601.18053

work page arXiv 2026
[2]

PLOS ONE8(7), e68979 (07 2013)

Auersperg, A.M.I., Kacelnik, A., von Bayern, A.M.P.: Explorative learning and functional inferences on a five-step means-means-end problem in goffin’s cockatoos (cacatua goffini). PLOS ONE8(7), e68979 (07 2013)

work page 2013
[3]

Baum, M., Bernstein, M., Martin-Martin, R., Höfer, S., Kulick, J., Toussaint, M., Kacelnik,A.,Brock,O.:Openingalockboxthroughphysicalexploration.In:IEEE- RAS International Conference on Humanoid Robotics (Humanoids). pp. 461–467 (2017)

work page 2017
[4]

Cohen, P.R.: Empirical methods for artificial intelligence, vol. 139. MIT press Cam- bridge, MA (1995)

work page 1995
[5]

Lai, H.: SpecRA: Monitor degenerative repetition in LLM agents using randomized FFT (2026),https://openreview.net/forum?id=xVO4BqmzVD

work page 2026
[6]

Frontiers in Behavioral Neuroscience17, 1230082 (2023)

Lang, B., Kahnau, P., Hohlbaum, K., Mieske, P., Andresen, N.P., Boon, M.N., Thöne-Reineke, C., Lewejohann, L., Diederich, K.: Challenges and advanced con- cepts for the assessment of learning and memory function in mice. Frontiers in Behavioral Neuroscience17, 1230082 (2023)

work page 2023
[7]

arXiv preprint arXiv:2408.10192 (2024)

Li, X., Zenkri, O., Pfisterer, A., Brock, O.: A biologically inspired design principle for building robust robotic systems. arXiv preprint arXiv:2408.10192 (2024)

work page arXiv 2024
[8]

In: Conference on Robot Learning (CoRL)

Liu, Z., Xu, Z., Song, S.: BusyBot: Learning to interact, reason, and plan in a BusyBoard environment. In: Conference on Robot Learning (CoRL). pp. 505–515. PMLR (2023)

work page 2023
[9]

Nature Machine Intelligence pp

Mon-Williams, R., Li, G., Long, R., Du, W., Lucas, C.G.: Embodied large language models enable robots to complete complex tasks in unpredictable environments. Nature Machine Intelligence pp. 1–10 (2025)

work page 2025
[10]

IEEE Transactions on Robotics38(6), 3434–3449 (2022)

Puhlmann, S., Harris, J., Brock, O.: RBO Hand 3: A platform for soft dexterous manipulation. IEEE Transactions on Robotics38(6), 3434–3449 (2022)

work page 2022
[11]

Proceedings of the Royal Society B: Biological Sciences291(2027), 20240911 (2024)

Stanton, L.A., Cooley-Ackermann, C., Davis, E.C., Fanelli, R.E., Benson-Amram, S.: Wild raccoons demonstrate flexibility and individuality in innovative problem- solving. Proceedings of the Royal Society B: Biological Sciences291(2027), 20240911 (2024)

work page 2027
[12]

In: IEEE International Conference on Robotics and Automation (ICRA)

Verghese, M., Atkeson, C.: Using memory-based learning to solve tasks with state- action constraints. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 9558–9565 (2023)

work page 2023
[13]

In: International Conference on Simula- tion of Adaptive Behavior

Zenkri, O., Bolenz, F., Pachur, T., Brock, O.: Extracting principles of exploration strategies with a complex ecological task. In: International Conference on Simula- tion of Adaptive Behavior. pp. 289–300. Springer (2024)

work page 2024
[14]

Adaptive Behavior33(5-6), 321–332 (2025).https://doi.org/10.1177/ 10597123251364738

Zenkri, O., Bolenz, F., Pachur, T., Brock, O.: Human exploration in com- plex problem-solving tasks: More effortful interaction leads to higher effi- ciency. Adaptive Behavior33(5-6), 321–332 (2025).https://doi.org/10.1177/ 10597123251364738

work page 2025
[15]

In: Proceedings of the German Robotics Conference (GRC) (2026)

Zenkri, O., Brock, O.: How far are llms from agi? evidence from a large world problem. In: Proceedings of the German Robotics Conference (GRC) (2026)

work page 2026

[1] [1]

Agrawal, P., Goyal, P.: Addressing llm diversity by infusing random concepts (2026),https://arxiv.org/abs/2601.18053

work page arXiv 2026

[2] [2]

PLOS ONE8(7), e68979 (07 2013)

Auersperg, A.M.I., Kacelnik, A., von Bayern, A.M.P.: Explorative learning and functional inferences on a five-step means-means-end problem in goffin’s cockatoos (cacatua goffini). PLOS ONE8(7), e68979 (07 2013)

work page 2013

[3] [3]

Baum, M., Bernstein, M., Martin-Martin, R., Höfer, S., Kulick, J., Toussaint, M., Kacelnik,A.,Brock,O.:Openingalockboxthroughphysicalexploration.In:IEEE- RAS International Conference on Humanoid Robotics (Humanoids). pp. 461–467 (2017)

work page 2017

[4] [4]

Cohen, P.R.: Empirical methods for artificial intelligence, vol. 139. MIT press Cam- bridge, MA (1995)

work page 1995

[5] [5]

Lai, H.: SpecRA: Monitor degenerative repetition in LLM agents using randomized FFT (2026),https://openreview.net/forum?id=xVO4BqmzVD

work page 2026

[6] [6]

Frontiers in Behavioral Neuroscience17, 1230082 (2023)

Lang, B., Kahnau, P., Hohlbaum, K., Mieske, P., Andresen, N.P., Boon, M.N., Thöne-Reineke, C., Lewejohann, L., Diederich, K.: Challenges and advanced con- cepts for the assessment of learning and memory function in mice. Frontiers in Behavioral Neuroscience17, 1230082 (2023)

work page 2023

[7] [7]

arXiv preprint arXiv:2408.10192 (2024)

Li, X., Zenkri, O., Pfisterer, A., Brock, O.: A biologically inspired design principle for building robust robotic systems. arXiv preprint arXiv:2408.10192 (2024)

work page arXiv 2024

[8] [8]

In: Conference on Robot Learning (CoRL)

Liu, Z., Xu, Z., Song, S.: BusyBot: Learning to interact, reason, and plan in a BusyBoard environment. In: Conference on Robot Learning (CoRL). pp. 505–515. PMLR (2023)

work page 2023

[9] [9]

Nature Machine Intelligence pp

Mon-Williams, R., Li, G., Long, R., Du, W., Lucas, C.G.: Embodied large language models enable robots to complete complex tasks in unpredictable environments. Nature Machine Intelligence pp. 1–10 (2025)

work page 2025

[10] [10]

IEEE Transactions on Robotics38(6), 3434–3449 (2022)

Puhlmann, S., Harris, J., Brock, O.: RBO Hand 3: A platform for soft dexterous manipulation. IEEE Transactions on Robotics38(6), 3434–3449 (2022)

work page 2022

[11] [11]

Proceedings of the Royal Society B: Biological Sciences291(2027), 20240911 (2024)

Stanton, L.A., Cooley-Ackermann, C., Davis, E.C., Fanelli, R.E., Benson-Amram, S.: Wild raccoons demonstrate flexibility and individuality in innovative problem- solving. Proceedings of the Royal Society B: Biological Sciences291(2027), 20240911 (2024)

work page 2027

[12] [12]

In: IEEE International Conference on Robotics and Automation (ICRA)

Verghese, M., Atkeson, C.: Using memory-based learning to solve tasks with state- action constraints. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 9558–9565 (2023)

work page 2023

[13] [13]

In: International Conference on Simula- tion of Adaptive Behavior

Zenkri, O., Bolenz, F., Pachur, T., Brock, O.: Extracting principles of exploration strategies with a complex ecological task. In: International Conference on Simula- tion of Adaptive Behavior. pp. 289–300. Springer (2024)

work page 2024

[14] [14]

Adaptive Behavior33(5-6), 321–332 (2025).https://doi.org/10.1177/ 10597123251364738

Zenkri, O., Bolenz, F., Pachur, T., Brock, O.: Human exploration in com- plex problem-solving tasks: More effortful interaction leads to higher effi- ciency. Adaptive Behavior33(5-6), 321–332 (2025).https://doi.org/10.1177/ 10597123251364738

work page 2025

[15] [15]

In: Proceedings of the German Robotics Conference (GRC) (2026)

Zenkri, O., Brock, O.: How far are llms from agi? evidence from a large world problem. In: Proceedings of the German Robotics Conference (GRC) (2026)

work page 2026