pith. sign in

arxiv: 2605.15938 · v1 · pith:W35LIN2Snew · submitted 2026-05-15 · ⚛️ physics.bio-ph · cs.LG

Clock-state olfactory search in turbulent flows using Q-learning: The geometry of plume recovery

Pith reviewed 2026-05-19 17:30 UTC · model grok-4.3

classification ⚛️ physics.bio-ph cs.LG
keywords olfactory searchQ-learningturbulent flowsplume recoveryinsect navigationreinforcement learningintermittencybio-inspired robotics
0
0 comments X

The pith

A running clock since the last odor whiff lets a Q-learning agent learn surging, casting and downwind return to recover plumes in turbulence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains a tabular Q-learning agent whose only memory is a clock that counts time since the most recent odor detection. This minimal state is enough for the agent to discover a policy that surges forward when an odor is detected, casts crosswind after losing the plume, and returns downwind to search again. The strategy works on data from direct numerical simulations of turbulent flow and reproduces behaviors seen in insects. The same agent cannot adjust its actions to the local level of intermittency in the plume, but adding more state flexibility makes the policy more robust across different turbulence conditions.

Core claim

Using only a running clock since the last whiff as its state representation, tabular Q-learning produces an interpretable recovery policy that combines surging, casting, and a return downwind; this policy performs well on direct numerical simulation data of turbulent odor plumes yet remains limited by its inability to adapt to local intermittency, which additional state flexibility can mitigate.

What carries the argument

A single running clock that tracks time elapsed since the last odor whiff, serving as the complete state input to tabular Q-learning for navigation decisions.

If this is right

  • The learned policy reproduces the combination of surging, casting, and downwind return observed in insects.
  • The clock-only agent achieves good performance on direct numerical simulations of turbulent flows.
  • Inability to adapt to local intermittency levels limits robustness of the learned strategy.
  • Providing more flexibility in the agent's state representation improves performance across different intermittency conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Simple time-based memory may be sufficient for basic plume recovery even when full history of detections is unavailable.
  • The same clock-state approach could be tested on real insect trajectories to see whether time since last whiff predicts their turns and surges.
  • In environments where intermittency fluctuates rapidly, agents may need explicit intermittency estimation rather than relying on a fixed clock policy.

Load-bearing premise

A single running clock since the last whiff supplies enough state information for tabular Q-learning to converge on a robust recovery policy across varying turbulence intermittency levels.

What would settle it

Run the same Q-learning procedure on turbulence data with controlled intermittency changes and check whether the clock-only agent still recovers the plume at the reported success rate or whether the claimed improvement from added flexibility disappears.

Figures

Figures reproduced from arXiv: 2605.15938 by Agnese Seminara, Marco Rando, Robin A. Heinonen, Yujia Qi.

Figure 1
Figure 1. Figure 1: FIG. 1. Problem setting. (a) We perform simulations of a [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2. Performance of the agents presented in this paper. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: FIG. 4. A Q-agent’s generalization to different sparsity levels. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FIG. 5. A more flexible algorithm with two separate Q [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FIG. 6. Top: colormap of the average Eulerian blank time (the typical time between successive odor detections) at increasing [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: FIG. 7. In the two-Q agent, [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: FIG. 8. Geometry of a single quasi-optimal Bayesian recovery [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: FIG. 9. Generalization of single Q agents (left) and two-Qs agents (right), measured by the cumulative reward [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: FIG. 10. Cast width, upwind search and downwind length, defined in the main text, depend on the duration of the trajectory [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: FIG. 11. Results of the 2Q algorithms are robust to the choice of the threshold [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: FIG. 12. Policies of the 20 single Q agents, trained in the four environments with increasing sparsity from top to bottom. The [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: FIG. 13. Supplementary. Optimizing agents starting with different initial conditions yield qualitatively similar patterns in [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
read the original abstract

Finding an odor source in a turbulent flow requires effectively leveraging the history of olfactory observations into a robust navigation strategy. In this work, we use tabular Q-learning to train an olfactory search agent with a minimal memory of past observations: only a running clock since the last whiff. This agent learns an interpretable strategy to recover the plume which combines well-known behaviors observed in insects: surging, casting, and a return downwind. While achieving good performance on data from direct numerical simulations of turbulence, the agent is limited by an inability to adapt its strategy to the local intermittency level; we show that providing more flexibility improves robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript trains a tabular Q-learning agent for odor-source localization in turbulent flows using a minimal state representation consisting solely of the time elapsed since the last odor detection (whiff). The learned policy combines surging, casting, and downwind return maneuvers that match documented insect behaviors. The agent is reported to achieve good performance when tested on direct numerical simulation (DNS) turbulence data, yet the authors explicitly note its inability to modulate behavior according to local intermittency; they demonstrate that increasing state flexibility improves robustness.

Significance. If the quantitative results hold, the work shows that a single scalar clock variable is sufficient for reinforcement learning to discover an interpretable, biologically plausible recovery strategy in realistic turbulence. The explicit acknowledgment of the intermittency-adaptation limitation, together with the improvement obtained by relaxing the state, supplies a concrete, falsifiable next step for minimal-memory olfactory navigation models. Use of DNS data rather than idealized plume models strengthens the ecological relevance of the evaluation.

major comments (2)
  1. [§3 and §4] §3 (Methods) and §4 (Results): the abstract and main text assert 'good performance' and 'performance gains from added flexibility' on DNS data, yet no numerical values for success rate, mean time-to-source, or comparison against baselines are supplied, nor are error bars or statistical tests reported. These omissions make it impossible to judge whether the clock-state policy is genuinely competitive or merely qualitatively plausible.
  2. [§2.2] §2.2 (State definition): the state is restricted to a single running clock since the last whiff. The manuscript itself states that this construction prevents adaptation to local intermittency; because the central robustness claim rests on performance across varying turbulence levels, the absence of any explicit intermittency measure in the state is load-bearing and should be quantified by comparing policies with and without an intermittency statistic.
minor comments (2)
  1. [Figure 3] Figure 3 (or equivalent policy visualization): the trajectories shown are helpful, but the caption should explicitly state the turbulence parameters (Re, Sc, source strength) used for the DNS snapshots.
  2. [§2.2] Notation: the symbol for the clock variable is introduced without a clear definition of its discretization bins or maximum value; this should be stated once in §2.2 and used consistently thereafter.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation and recommendation for minor revision. We value the feedback on strengthening the quantitative aspects and the robustness analysis. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Methods) and §4 (Results): the abstract and main text assert 'good performance' and 'performance gains from added flexibility' on DNS data, yet no numerical values for success rate, mean time-to-source, or comparison against baselines are supplied, nor are error bars or statistical tests reported. These omissions make it impossible to judge whether the clock-state policy is genuinely competitive or merely qualitatively plausible.

    Authors: We agree that providing quantitative performance metrics is important for rigorously evaluating the agent's effectiveness. The current version of the manuscript focuses on the emergence of interpretable behaviors and the acknowledgment of limitations, but we will revise §3 and §4 to include specific numerical values for success rates, mean time-to-source, comparisons against relevant baselines, error bars, and statistical significance tests based on the DNS turbulence data. This will substantiate the claims of good performance and gains from added flexibility. revision: yes

  2. Referee: [§2.2] §2.2 (State definition): the state is restricted to a single running clock since the last whiff. The manuscript itself states that this construction prevents adaptation to local intermittency; because the central robustness claim rests on performance across varying turbulence levels, the absence of any explicit intermittency measure in the state is load-bearing and should be quantified by comparing policies with and without an intermittency statistic.

    Authors: The manuscript already highlights the limitation of the single-clock state in adapting to local intermittency and shows that increased state flexibility leads to improved robustness. To directly quantify the effect as suggested, we will add in the revised manuscript an explicit comparison between the clock-state policy and policies augmented with an intermittency statistic. This will provide a clearer assessment of how the absence of such a measure impacts performance across different turbulence levels. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the Q-learning derivation from DNS data

full rationale

The paper trains a tabular Q-learning agent on direct numerical simulation data of turbulent flows using only a running clock since the last whiff as state. The learned policy is reported to combine surging, casting, and downwind return behaviors observed independently in insects, with performance evaluated on the external DNS dataset. No equations or steps in the derivation reduce claimed performance or strategy to a fitted parameter or self-defined input by construction, and no load-bearing self-citation chains or ansatzes are present. The approach remains self-contained against external benchmarks from turbulence simulations and biological observations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard reinforcement-learning convergence assumptions and the fidelity of the direct numerical simulation data to real turbulent odor transport; no new entities are postulated.

axioms (2)
  • standard math Tabular Q-learning converges to an optimal policy for the defined finite MDP under sufficient exploration and learning rate conditions
    Invoked implicitly by training the agent to a stable policy
  • domain assumption The DNS turbulence fields accurately represent the intermittency and geometry of real odor plumes
    Performance is evaluated exclusively on these simulated fields

pith-pipeline@v0.9.0 · 5641 in / 1363 out tokens · 44063 ms · 2026-05-19T17:30:50.526306+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Smart strategies to navigate turbulent odor plumes reorienting to local wind

    physics.flu-dyn 2026-05 unverdicted novelty 6.0

    Reinforcement learning policies using elapsed time since odor detection and exponentially filtered local wind direction outperform cast-and-surge in simulated turbulent plumes with mild mean wind and show optimal perf...

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 1 Pith paper

  1. [1]

    Elementary sensory-motor transfor- mations underlying olfactory navigation in walking fruit- flies.Elife, 7:e37815, 2018

    Efr´ en´Alvarez-Salvado, Angela M Licata, Erin G Con- nor, Margaret K McHugh, Benjamin MN King, Nicholas Stavropoulos, Jonathan D Victor, John P Crimaldi, and Katherine I Nagel. Elementary sensory-motor transfor- mations underlying olfactory navigation in walking fruit- flies.Elife, 7:e37815, 2018

  2. [2]

    T. C. Baker. Upwind flight and casting flight: com- plementary and tonic systems used for location of sex pheromone sources by male moths.Proc. 10 th Intl Sym- posium on Olfaction and Taste, 13:18, 1990

  3. [3]

    Manoeuvres used by flying male oriental fruit moths to relocate a sex pheromone plume in an experimentally shifted wind-field

    TC Baker and Kenneth F Haynes. Manoeuvres used by flying male oriental fruit moths to relocate a sex pheromone plume in an experimentally shifted wind-field. Physiological Entomology, 12(3):263–279, 1987

  4. [4]

    Olfactory search at high Reynolds number.Proceedings of the na- tional academy of sciences, 99(20):12589–12593, 2002

    Eugene Balkovsky and Boris I Shraiman. Olfactory search at high Reynolds number.Proceedings of the na- tional academy of sciences, 99(20):12589–12593, 2002

  5. [5]

    W. J. Bell and E. Kramer. Search and anemotaxis in insects.J. Insect Physiol, 25:631–640, 1979

  6. [6]

    Adaptive temporal processing of odor stimuli.Cell and Tissue Research, 383(1):125–141, 2021

    Sofia C Brand˜ ao, Marion Silies, and Carlotta Martelli. Adaptive temporal processing of odor stimuli.Cell and Tissue Research, 383(1):125–141, 2021

  7. [7]

    Card´ e’

    Ring T. Card´ e’. Navigation along windborne plumes of pheromone and resource-linked odors.Annual Review of Enthomology, 66:317–336, 2021

  8. [8]

    Navigational strategies used by insects to find distant, wind-borne sources of odor.Journal of chemical ecology, 34(7):854–866, 2008

    Ring T Card´ e and Mark A Willis. Navigational strategies used by insects to find distant, wind-borne sources of odor.Journal of chemical ecology, 34(7):854–866, 2008

  9. [9]

    Odor landscapes in turbulent environments

    Antonio Celani, Emmanuel Villermaux, and Massimo Vergassola. Odor landscapes in turbulent environments. Physical Review X, 4(4):041015, 2014

  10. [10]

    Find- ing of a sex pheromone source by gypsy moths released in the field.Nature, 303(5920):804–806, 1983

    Cassandra T David, JS Kennedy, and AR Ludlow. Find- ing of a sex pheromone source by gypsy moths released in the field.Nature, 303(5920):804–806, 1983

  11. [11]

    Walking drosophila navigate complex plumes using stochastic decisions bi- ased by the timing of odor encounters.Elife, 9:e57524, 2020

    Mahmut Demir, Nirag Kadakia, Hope D Anderson, Da- mon A Clark, and Thierry Emonet. Walking drosophila navigate complex plumes using stochastic decisions bi- ased by the timing of odor encounters.Elife, 9:e57524, 2020

  12. [12]

    Optimal policies for Bayesian ol- factory search in turbulent flows.Physical Review E, 107(5):055105, 2023

    Robin A Heinonen, Luca Biferale, Antonio Celani, and Massimo Vergassola. Optimal policies for Bayesian ol- factory search in turbulent flows.Physical Review E, 107(5):055105, 2023

  13. [13]

    Exploring Bayesian olfactory search in realistic turbulent flows.Physical Review Fluids, 10(6):064614, 2025

    Robin A Heinonen, Luca Biferale, Antonio Celani, and Massimo Vergassola. Exploring Bayesian olfactory search in realistic turbulent flows.Physical Review Fluids, 10(6):064614, 2025

  14. [14]

    Optimal trajectories for Bayesian olfactory search in turbulent flows: The low information limit and beyond.Physical review fluids, 10(4):044601, 2025

    Robin A Heinonen, Luca Biferale, Antonio Celani, and Massimo Vergassola. Optimal trajectories for Bayesian olfactory search in turbulent flows: The low information limit and beyond.Physical review fluids, 10(4):044601, 2025

  15. [15]

    Mobile robot navigation using vi- 12 sion and olfaction to search for a gas/odor source.Au- tonomous Robots, 20(3):231–238, 2006

    Hiroshi Ishida, Hidenao Tanaka, Haruki Taniguchi, and Toyosaka Moriizumi. Mobile robot navigation using vi- 12 sion and olfaction to search for a gas/odor source.Au- tonomous Robots, 20(3):231–238, 2006

  16. [16]

    Reinforcement learning algorithm for partially observable markov decision problems

    Tommi Jaakkola, Satinder Singh, and Michael Jordan. Reinforcement learning algorithm for partially observable markov decision problems. In G. Tesauro, D. Touretzky, and T. Leen, editors,Advances in Neural Information Processing Systems, volume 7. MIT Press, 1994

  17. [17]

    Viraaj Jayaram, Aarti Sehdev, Nirag Kadakia, Ethan A Brown, and Thierry Emonet. Temporal novelty detec- tion and multiple timescale integration drive drosophila orientation dynamics in temporally diverse olfactory en- vironments.PLoS computational biology, 19(5):e1010606, 2023

  18. [18]

    Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998

    Leslie Pack Kaelbling, Michael L Littman, and An- thony R Cassandra. Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998

  19. [19]

    Neural dynamics for working memory and evidence integration during olfactory navigation in drosophila.bioRxiv, pages 2024–10, 2025

    Nicholas D Kathman, Aaron J Lanz, Jacob D Freed, and Katherine I Nagel. Neural dynamics for working memory and evidence integration during olfactory navigation in drosophila.bioRxiv, pages 2024–10, 2025

  20. [20]

    Strategies for recon- tacting a lost pheromone plume: casting and upwind flight in the male gypsy moth.Physiological Entomol- ogy, 19(1):15–29, 1994

    LPS Kuenen and Ring T Carde. Strategies for recon- tacting a lost pheromone plume: casting and upwind flight in the male gypsy moth.Physiological Entomol- ogy, 19(1):15–29, 1994

  21. [21]

    Sar- sop: Efficient point-based pomdp planning by approxi- mating optimally reachable belief spaces

    Hanna Kurniawati, David Hsu, Wee Sun Lee, et al. Sar- sop: Efficient point-based pomdp planning by approxi- mating optimally reachable belief spaces. InRobotics: Science and systems, volume 2008. Zurich, Switzerland, 2008

  22. [22]

    James C. Liao. The role of the lateral line and vision on body kinematics and hydrodynamic preference of rain- bow trout in turbulent flow.Journal of Experimental Biology, 209(20):4077–4090, 10 2006

  23. [23]

    Aurore Loisy and Christophe Eloy. Searching for a source without gradients: how good is infotaxis and how to beat it.Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 478(2262), 2022

  24. [24]

    Deep reinforcement learning for the olfactory search pomdp: a quantitative benchmark.The European Physical Journal E, 46(3):17, 2023

    Aurore Loisy and Robin A Heinonen. Deep reinforcement learning for the olfactory search pomdp: a quantitative benchmark.The European Physical Journal E, 46(3):17, 2023

  25. [25]

    Dissection of the pheromone-modulated flight of moths using single-pulse response as a template.Cellular and Molecular Life Sci- ences, 52(4):373–379, 1996

    A Mafra-Neto and RT Card´ e. Dissection of the pheromone-modulated flight of moths using single-pulse response as a template.Cellular and Molecular Life Sci- ences, 52(4):373–379, 1996

  26. [26]

    T. S. Okubo, P. Patella, I. D’Alessandro, and R. I. Wil- son. A neural network for wind-guided compass naviga- tion.Neuron, 107:924–940, 2020

  27. [27]

    History dependence in insect flight decisions during odor tracking.PLoS com- putational biology, 14(2):e1005969, 2018

    Rich Pang, Floris Van Breugel, Michael Dickinson, Jef- frey A Riffell, and Adrienne Fairhall. History dependence in insect flight decisions during odor tracking.PLoS com- putational biology, 14(2):e1005969, 2018

  28. [28]

    Mainak Patel and Aaditya Rangan. Olfactory encoding within the insect antennal lobe: The emergence and role of higher order temporal correlations in the dynamics of antennal lobe spiking activity.Journal of theoretical biology, 522:110700, 2021

  29. [29]

    Heinonen, Marco Rando, and Agnese Seminara

    Lorenzo Piro, Maurizio Carbone, Luca Biferale, Massimo Cencini, Robin A. Heinonen, Marco Rando, and Agnese Seminara. Smart strategies to navigate turbulent odor plumes reorienting to local wind.preprint, 209:4077– 4090, 2026

  30. [30]

    Q-learning with tempo- ral memory to navigate turbulence.Elife, 13:RP102906, 2025

    Marco Rando, Martin James, Alessandro Verri, Lorenzo Rosasco, and Agnese Seminara. Q-learning with tempo- ral memory to navigate turbulence.Elife, 13:RP102906, 2025

  31. [31]

    Olfactory sensing and navigation in turbu- lent environments.Annual Review of Condensed Matter Physics, 13(1):191–213, 2022

    Gautam Reddy, Venkatesh N Murthy, and Massimo Ver- gassola. Olfactory sensing and navigation in turbu- lent environments.Annual Review of Condensed Matter Physics, 13(1):191–213, 2022

  32. [32]

    A. M. Reynolds, D. R. Reynolds, A. D. Smith, and J.W. Chapman. Orientation cues for high-flying nocturnal in- sect migrants: Do turbulence-induced temperature and velocity fluctuations indicate the mean wind flow?Plos ONE, 5:e15758, 2010

  33. [33]

    Alternation emerges as a multi- modal strategy for turbulent odor navigation.Elife, 11:e76989, 2022

    Nicola Rigolli, Gautam Reddy, Agnese Seminara, and Massimo Vergassola. Alternation emerges as a multi- modal strategy for turbulent odor navigation.Elife, 11:e76989, 2022

  34. [34]

    Emergent behaviour and neu- ral dynamics in artificial agents tracking odour plumes

    Satpreet H Singh, Floris Van Breugel, Rajesh PN Rao, and Bingni W Brunton. Emergent behaviour and neu- ral dynamics in artificial agents tracking odour plumes. Nature machine intelligence, 5(1):58–70, 2023

  35. [35]

    Odor tracking in insects: a multi- sensory behavior.Journal of Experimental Biology, 229(Suppl 1):jeb250945, 2026

    Shuchita Soman, Sree Subha Ramaswamy, and San- jay P Sane. Odor tracking in insects: a multi- sensory behavior.Journal of Experimental Biology, 229(Suppl 1):jeb250945, 2026

  36. [36]

    Olfactory navigation in arthropods.Journal of Compar- ative Physiology A, 209(4):467–488, 2023

    Theresa J Steele, Aaron J Lanz, and Katherine I Nagel. Olfactory navigation in arthropods.Journal of Compar- ative Physiology A, 209(4):467–488, 2023

  37. [37]

    MIT press Cam- bridge, 1998

    Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction, volume 1. MIT press Cam- bridge, 1998

  38. [38]

    M. P. Suver, A. M. Matheson, S. Sarkar, M. Damiata, D. Schoppik, and K. I. Nagel. Encoding of wind direction by central neurons in drosophila.Neuron, 102:828–842, 2019

  39. [39]

    Plume- tracking behavior of flying drosophila emerges from a set of distinct sensory-motor reflexes.Current Biology, 24(3):274–286, 2014

    Floris Van Breugel and Michael H Dickinson. Plume- tracking behavior of flying drosophila emerges from a set of distinct sensory-motor reflexes.Current Biology, 24(3):274–286, 2014

  40. [40]

    Plume-tracking be- havior of flying drosophila emerges from a set of distinct sensory-motor reflexes.Curr Biol, 24:274, 2014

    van Breugel F and Dickinson MH. Plume-tracking be- havior of flying drosophila emerges from a set of distinct sensory-motor reflexes.Curr Biol, 24:274, 2014

  41. [41]

    Olfactory search with finite-state con- trollers.Proceedings of the National Academy of Sciences, 120(34):e2304230120, 2023

    Kyrell Vann B Verano, Emanuele Panizon, and An- tonio Celani. Olfactory search with finite-state con- trollers.Proceedings of the National Academy of Sciences, 120(34):e2304230120, 2023

  42. [42]

    ‘infotaxis’ as a strategy for searching without gradients.Nature, 445(7126):406–409, 2007

    Massimo Vergassola, Emmanuel Villermaux, and Boris I Shraiman. ‘infotaxis’ as a strategy for searching without gradients.Nature, 445(7126):406–409, 2007

  43. [43]

    NJ Vickers and TC Baker. Latencies of behavioral re- sponse to interception of filaments of sex pheromone and clean air influence flight track shape in Heliothis virescens (F.) males.Journal of Comparative Physiology A, 178(6):831–847, 1996

  44. [44]

    Q-learning

    Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3):279–292, 1992

  45. [45]

    Odor-modulated upwind flight of the sphinx moth, manduca sexta l.Jour- nal of Comparative Physiology A, 169(4):427–440, 1991

    Mark A Willis and Edmund A Arbas. Odor-modulated upwind flight of the sphinx moth, manduca sexta l.Jour- nal of Comparative Physiology A, 169(4):427–440, 1991

  46. [46]

    successful

    Yan S. W. Yu, Matthew M. Graff, Chris S. Bresee, Yan B. Man, and Mitra J. Z. Hartmann. Whiskers aid anemo- taxis in rats.Science Advances, 2(8):e1600716, 2016. 13 Appendix A: Supplementary Material FIG. 9. Generalization of single Q agents (left) and two-Qs agents (right), measured by the cumulative rewardG(top tow) and its projections on the normalized t...