arxiv: 2605.12569 · v1 · submitted 2026-05-12 · 📡 eess.SP · cs.AI

Recognition: unknown

Active Sensing with Meta-Reinforcement Learning for Emitter Localization from RF Observations

M. Shamail J. Khan , Nisha L. Raichur , Lucas Heublein , Christian Wielenberg , Alexander Mattick , Tobias Feigl , Christopher Mutschler , Felix Ott

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:40 UTC · model grok-4.3

classification 📡 eess.SP cs.AI

keywords GNSS interference localizationactive sensingreinforcement learningRF observationsmultipath propagationray tracing simulationemitter localization

0 comments

The pith

An RL agent localizes GNSS interference sources by choosing sequential RF sensing actions from a 2x2 antenna.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames GNSS interference localization as an active sensing task in which an agent must decide where to steer its antenna for the next radio measurement based on prior observations. Single snapshots are often ambiguous under multipath and changing channels, so the task is modeled as a partially observable decision process solved with recurrent deep reinforcement learning. Both DQN and PPO policies are trained on Sionna ray-tracing data that includes realistic propagation effects and domain shifts. The resulting agents reach an 80.1 percent success rate in identifying the emitter position. A reader would care because this offers an adaptive way to find jamming sources indoors or in cities where fixed scanning methods break down.

Core claim

By modeling emitter localization as a partially observable Markov decision process, the authors train recurrent reinforcement learning policies that select sensing actions from sequences of RF observations collected by a 2x2 patch antenna. In Sionna ray-tracing simulations that include realistic multipath and domain shifts, the resulting agents achieve a localization success rate of 80.1 percent.

What carries the argument

Recurrent policy or value network that maintains an internal state over time to map high-dimensional RF inputs to discrete sensing actions or a localization guess.

If this is right

The agent adapts its sensing locations on the fly instead of following a fixed scan pattern.
Both value-based and policy-based RL algorithms solve the task, showing the active-sensing formulation works with standard deep RL methods.
Simulation training supplies a route to policies that handle varying propagation conditions without collecting real-world data.
Partial observability caused by multipath can be overcome by accumulating evidence across multiple observations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same policy could be placed on a mobile robot or drone carrying the antenna array if the simulation gap is small.
The active-sensing formulation may extend to locating other RF sources such as wireless devices or radar targets in cluttered spaces.
Adding explicit uncertainty estimates to the state could let the agent decide more reliably when to stop and report a position.

Load-bearing premise

The ray-tracing model used for training produces observation statistics that closely match those of real RF hardware in physical environments.

What would settle it

Deploy the trained policy on physical hardware in a multipath-rich indoor testbed with known emitter positions and measure whether the localization success rate remains near 80 percent.

Figures

Figures reproduced from arXiv: 2605.12569 by Alexander Mattick, Christian Wielenberg, Christopher Mutschler, Felix Ott, Lucas Heublein, M. Shamail J. Khan, Nisha L. Raichur, Tobias Feigl.

**Figure 3.** Figure 3: Discrete action space for structured navigation. position and pI ∈ R 3 denotes the fixed but unknown interference source position. Given an action at (see [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Feedforward PPO architecture. where ϵ denotes the localization threshold and Rgoal is a positive constant. The total reward at timestep t is given by rt = rprog(t) + rstep + rsucc. (12) In preliminary experiments, more elaborate shaping terms led to unstable training and increased return variance. So, we retain the above reward design for its simplicity and stability for generalization across different pro… view at source ↗

**Figure 7.** Figure 7: Explained variance of the value function. [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Feature importance for different signal representations. [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of episodic length and return values for [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗

read the original abstract

Global navigation satellite system (GNSS) interference poses a serious threat to reliable positioning, especially in indoor and multipath-rich environments where source localization is highly challenging. In this paper, we formulate GNSS interference localization as an active sensing problem and propose a reinforcement learning (RL) framework in which an agent sequentially explores the environment to infer the position of an emitter source from radio frequency (RF) observations acquired with a 2x2 patch antenna. The localization task is modeled as a partially observable decision process, since single-snapshot measurements are often ambiguous under multipath propagation and changing channel conditions. To address this, the proposed framework combines high-dimensional RF sensing with deep RL and recurrent policy learning. We investigate both value-based and policy-based approaches, namely Deep Q-Networks (DQN) and Proximal Policy Optimization (PPO), and study their behavior under domain shift. The approach is evaluated on a simulated dataset generated with the Sionna ray-tracing module, which provides realistic propagation effects and diverse environment configurations. Experimental results show that the proposed method achieves a localization success rate of 80.1%, demonstrating the potential of RL for adaptive GNSS interference localization. Overall, the results highlight simulation-assisted training as a promising direction for robust interference localization in challenging propagation environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies DQN and PPO with recurrent policies to active RF emitter localization in Sionna simulations and reports 80.1% success, but the lack of baselines and real-world tests limits how much weight the number can carry.

read the letter

The core contribution is framing GNSS interference localization as a POMDP and training recurrent RL policies on 2x2 antenna observations in ray-tracing simulations. They test both value-based and policy-based methods under domain shift and report an 80.1% localization success rate. That specific combination of active sensing, meta-RL elements, and Sionna for this GNSS problem is new enough to be worth noting as an application paper. The domain-shift experiments are a clear positive; they show the authors at least tried to probe robustness instead of reporting only in-distribution numbers. The simulation setup itself looks reasonable for exploring the idea. The main weaknesses are straightforward. No baselines appear in the reported results, so it is impossible to tell whether the RL agent beats simpler sequential or non-sequential localization approaches. The success definition is not spelled out, error bars are missing, and every number comes from simulation only. Without hardware recordings or explicit sim-to-real metrics, the practical claim stays untested. That gap is the largest soft spot. The work is aimed at people already working on RL for wireless sensing or GNSS security. A reader who wants concrete simulation recipes for POMDP-style RF tasks can extract useful implementation details. Anyone needing deployable performance will need the missing comparisons and real data first. I would send it to peer review. The formulation is coherent and the experiments are a solid first cut; referees can push for baselines, clearer metrics, and a plan for hardware validation without starting from zero.

Referee Report

3 major / 2 minor

Summary. The manuscript formulates GNSS interference localization as a POMDP active-sensing task and proposes an RL framework (DQN and PPO with recurrent policies) that sequentially selects observations from a 2x2 patch antenna to infer emitter position under multipath. All results are obtained from Sionna ray-tracing simulations that include domain-shift experiments; the central numerical claim is an 80.1% localization success rate.

Significance. If the simulation fidelity and policy transfer hold, the work would illustrate a concrete route for RL-driven adaptive RF sensing in challenging propagation environments, with recurrent policies addressing partial observability. The simulation-assisted training pipeline is a clear methodological strength, but the absence of any real RF data or hardware results caps the immediate practical significance.

major comments (3)

[Abstract / Experimental results] Abstract and results section: the 80.1% success rate is stated without any baseline (random, greedy, or non-RL) comparisons, ablation studies, error bars, or explicit definition of the success metric (e.g., distance threshold), preventing assessment of whether the RL component actually drives the reported performance.
[Simulation and evaluation] Simulation setup and evaluation: the entire performance claim rests on Sionna ray-tracing; no real-world RF recordings, hardware testbed results, or quantitative sim-to-real metrics (e.g., domain-adaptation gap) are provided, so the POMDP formulation and learned active-sensing behavior remain unverified for physical deployment.
[Title / Abstract] Title vs. abstract: the title advertises 'Meta-Reinforcement Learning' yet the abstract describes only standard DQN/PPO with recurrent policies and domain-shift experiments; the meta-learning mechanism (if present) is not specified, making it impossible to judge whether the meta component is load-bearing for the 80.1% result.

minor comments (2)

[Methods] Clarify the exact observation space dimensionality and how the 2x2 antenna patterns are incorporated into the Sionna channel model.
[Results] Add a table or figure caption that explicitly lists the success threshold (e.g., <5 m error) used for the 80.1% figure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and suggestions. We address each major point below and indicate the revisions planned for the manuscript.

read point-by-point responses

Referee: [Abstract / Experimental results] Abstract and results section: the 80.1% success rate is stated without any baseline (random, greedy, or non-RL) comparisons, ablation studies, error bars, or explicit definition of the success metric (e.g., distance threshold), preventing assessment of whether the RL component actually drives the reported performance.

Authors: We agree with this observation. The success metric is defined as the percentage of episodes in which the agent's final position estimate is within a 5 m Euclidean distance of the true emitter location. We will add this definition to the abstract and results section. Additionally, we will include baseline comparisons: a random sensing policy, a greedy policy that selects the antenna patch with maximum received power, and non-recurrent versions of DQN and PPO. Ablation studies removing the recurrent memory and varying the number of sensing steps will be presented. All results will include error bars computed over 10 independent training seeds. These additions will strengthen the evaluation of the RL contribution. revision: yes
Referee: [Simulation and evaluation] Simulation setup and evaluation: the entire performance claim rests on Sionna ray-tracing; no real-world RF recordings, hardware testbed results, or quantitative sim-to-real metrics (e.g., domain-adaptation gap) are provided, so the POMDP formulation and learned active-sensing behavior remain unverified for physical deployment.

Authors: The evaluations are performed exclusively in Sionna ray-tracing simulations to enable controlled study of multipath effects and domain shifts. We acknowledge the absence of real-world RF data or hardware experiments as a limitation of the current work. We will add a dedicated paragraph in the discussion section addressing the sim-to-real gap, referencing related literature on RF simulation fidelity, and outlining a roadmap for future hardware validation using software-defined radios. No quantitative sim-to-real metrics can be provided at this time without additional experimental data. revision: partial
Referee: [Title / Abstract] Title vs. abstract: the title advertises 'Meta-Reinforcement Learning' yet the abstract describes only standard DQN/PPO with recurrent policies and domain-shift experiments; the meta-learning mechanism (if present) is not specified, making it impossible to judge whether the meta component is load-bearing for the 80.1% result.

Authors: The meta-reinforcement learning component is realized through training the recurrent policies on a distribution of environment configurations (varying building layouts, material properties, and emitter positions) to promote generalization across domains, which is evaluated in the domain-shift experiments. This constitutes a meta-learning approach where the policy learns to adapt sensing strategies to new propagation conditions. We agree that the abstract does not sufficiently highlight this aspect. We will revise the abstract to explicitly describe the meta-RL framework, including how domain-shift training enables the meta-adaptation. The 80.1% result is obtained with this meta-trained policy. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical simulation results with no self-referential derivations or fitted predictions

full rationale

The manuscript formulates GNSS emitter localization as a POMDP and evaluates DQN/PPO recurrent policies on Sionna ray-tracing simulations, reporting an 80.1% success rate. No equations, parameter-fitting procedures, or derivation chains appear in the provided text that reduce a claimed prediction or result to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results is presented as a novel derivation. The central claim rests on direct empirical evaluation within the simulation environment, making the work self-contained against external benchmarks with no detectable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the fidelity of the Sionna simulation as a proxy for real RF environments; no explicit free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Sionna ray-tracing provides realistic propagation effects and diverse environment configurations sufficient for training and evaluating the RL agent
All reported results and the 80.1% success rate rest on this simulation assumption.

pith-pipeline@v0.9.0 · 5555 in / 1173 out tokens · 56352 ms · 2026-05-14T20:40:24.818591+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

[1]

Impact and Detection of GNSS Jammers on Consumer Grade Satellite Navigation Receivers,

D. Borio, F. Dovis, H. Kuusniemi, and L. L. Presti, “Impact and Detection of GNSS Jammers on Consumer Grade Satellite Navigation Receivers,” inProceedings of the IEEE, May 2016, pp. 1233–1245

work page 2016
[2]

Varia- tional & Generative Models with Quantization for Disentanglement and Compressed Sensing of GNSS Spectrograms,

L. Heublein, T. Feigl, A. R ¨ugamer, C. Mutschler, and F. Ott, “Varia- tional & Generative Models with Quantization for Disentanglement and Compressed Sensing of GNSS Spectrograms,” inIEEE J-ISPIN, 2026

work page 2026
[3]

GNSS Interference Mitigation: A Measurement and Position Domain Assessment,

D. Borio and C. Gioia, “GNSS Interference Mitigation: A Measurement and Position Domain Assessment,” inNAVIGATION, Jul. 2021

work page 2021
[4]

An Assessment of Impact of Adaptive Notch Filters for Interference Removal on the Signal Processing Stages of a GNSS Receiver,

W. Qin, M. T. Gamba, E. Falletti, and F. Dovis, “An Assessment of Impact of Adaptive Notch Filters for Interference Removal on the Signal Processing Stages of a GNSS Receiver,” inIEEE TAES, Apr. 2020

work page 2020
[5]

Distortionless Space-Time Adaptive Processor Based on MVDR Beamformer for GNSS Receiver,

X. Dai, J. Nie, F. Chen, and G. Ou, “Distortionless Space-Time Adaptive Processor Based on MVDR Beamformer for GNSS Receiver,” inIET Radar, Sonar & Navigation (RSN), Oct. 2017, pp. 1488–1494. Acknowledgments.This work has been carried out within the PaiL project, funding code 50NP2506, sponsored by the German Federal Ministry for Transport (BMV) and suppo...

work page 2017
[6]

On GNSS Jamming Threat from the Maritime Navigation Perspective,

D. Medina, C. Lass, E. P. Marcos, R. Ziebold, P. Closas, and J. Garc ´ıa, “On GNSS Jamming Threat from the Maritime Navigation Perspective,” inProc. Intl. Conf. on Information Fusion (FUSION), Jul. 2019, pp. 1–7

work page 2019
[7]

Using Sky-Pointing Fish-Eye Camera and LiDAR to Aid GNSS Single-Point Positioning in Urban Canyons,

X. Bai, W. Wen, and L. ta Hsu, “Using Sky-Pointing Fish-Eye Camera and LiDAR to Aid GNSS Single-Point Positioning in Urban Canyons,” inIET Intelligent Transport Systems, May 2020, pp. 908–914

work page 2020
[8]

Urban Area GNSS In-Car- Jammer Localization Based on Pattern Recognition,

D. Lyu, X. Chen, F. Wen, L. Pei, and D. He, “Urban Area GNSS In-Car- Jammer Localization Based on Pattern Recognition,” inNAVIGATION, Dec. 2019, pp. 325–340

work page 2019
[9]

Multiple Emitter Location and Signal Parameter Estima- tion,

R. Schmidt, “Multiple Emitter Location and Signal Parameter Estima- tion,” inIEEE TAP, Mar. 1986, pp. 276–280

work page 1986
[10]

Jammer Classification in GNSS Bands via Machine Learning Algorithms,

R. M. Ferre, A. D. L. Fuente, and E. S. Lohan, “Jammer Classification in GNSS Bands via Machine Learning Algorithms,” inSensors J., Nov. 2019, pp. 4841–4862

work page 2019
[11]

Attention-Based Fusion of IQ and FFT Spectrograms with AoA Features for GNSS Jammer Localization,

L. Heublein, C. Wielenberg, T. Nowak, T. Feigl, C. Mutschler, and F. Ott, “Attention-Based Fusion of IQ and FFT Spectrograms with AoA Features for GNSS Jammer Localization,” inRadarConf, Oct. 2025

work page 2025
[12]

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks,

C. Finn, P. Abbeel, and S. Levine, “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks,” inICML, Jul. 2017, pp. 1126–1135

work page 2017
[13]

Generalization in Reinforcement Learning by Soft Data Augmentation,

N. Hansen and X. Wang, “Generalization in Reinforcement Learning by Soft Data Augmentation,” inIEEE ICRA, Oct. 2021, pp. 13 611–13 617

work page 2021
[14]

Deep Reinforcement Learning with Robust Augmented Reward Sequence Prediction for Improving GNSS Positioning,

J. Tang, Z. Li, Q. Yu, H. Zhao, K. Zeng, S. Zhong, Q. Wang, K. Xie, V . Kuzin, and S. Xie, “Deep Reinforcement Learning with Robust Augmented Reward Sequence Prediction for Improving GNSS Positioning,” inGPS Solutions, Feb. 2025, p. 65

work page 2025
[15]

Deep Reinforcement Learning with Robust Spatial-Temporal Representation for Improving GNSS Positioning Correction,

Z. Li, P. Li, J. Tang, Y . Song, L. Chen, Y . Cai, and S. Xie, “Deep Reinforcement Learning with Robust Spatial-Temporal Representation for Improving GNSS Positioning Correction,” inGPS Solutions, Jan. 2025, pp. 1–35

work page 2025
[16]

Anti-Jamming Communication Using Imitation Learning,

Z. Zhou, Y . Niu, B. Wan, and W. Zhou, “Anti-Jamming Communication Using Imitation Learning,” inEntropy, Nov. 2023, p. 1547

work page 2023
[17]

On the Performance of Deep Reinforcement Learning-Based Anti-Jamming Method Confronting Intelligent Jammer,

Y . Li, X. Wang, D. Liu, Q. Guo, X. Liu, J. Zhang, and Y . Xu, “On the Performance of Deep Reinforcement Learning-Based Anti-Jamming Method Confronting Intelligent Jammer,” inAppl. Sc., Mar. 2019

work page 2019
[18]

Two-Dimensional Anti-Jamming Communication Based on Deep Reinforcement Learning,

G. Han, X. Liang, and H. V . Poor, “Two-Dimensional Anti-Jamming Communication Based on Deep Reinforcement Learning,” inIEEE ICASSP, Jun. 2017, pp. 2087–2091

work page 2017
[19]

Frequency Diversity Array Radar and Jammer Intelligent Frequency Domain Power Countermeasures Based on Multi-Agent Reinforcement Learning,

C. Zhou, C. Wang, L. Bao, X. Gao, J. Gong, and M. Tan, “Frequency Diversity Array Radar and Jammer Intelligent Frequency Domain Power Countermeasures Based on Multi-Agent Reinforcement Learning,” in Remote Sensing, Jun. 2024, p. 2127

work page 2024
[20]

Jam Me If You Can: Defeating Jammer with Deep Dueling Neural Network Architecture and Ambient Backscattering Augmented Communications,

V . H. Nguyen, D. N. Nguyen, D. T. Hoang, and E. Dutkiewicz, “Jam Me If You Can: Defeating Jammer with Deep Dueling Neural Network Architecture and Ambient Backscattering Augmented Communications,” inIEEE J-SAC, Aug. 2019, pp. 2603–2620

work page 2019
[21]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” inarXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Temporal Source Recovery for Time-Series Source-Free Unsupervised Domain Adaptation,

Y . Wang, P. Gong, M. Wu, F. Ott, X. Li, L. Xie, and Z. Chen, “Temporal Source Recovery for Time-Series Source-Free Unsupervised Domain Adaptation,” inIEEE TPAMI, Oct. 2025

work page 2025
[23]

Markov Decision Processes: Discrete Stochastic Dynamic Programming,

M. L. Puterman, “Markov Decision Processes: Discrete Stochastic Dynamic Programming,” inJohn Wiley and Sons, Apr. 2014

work page 2014
[24]

Reinforcement Learning: An Introduc- tion,

R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduc- tion,” inA Bradford Book, May 1998

work page 1998
[25]

Planning and Acting in Partially Observable Stochastic Domains,

L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and Acting in Partially Observable Stochastic Domains,” inArtificial Intelligence, May 1998, pp. 99–134

work page 1998
[26]

Playing Atari with Deep Reinforcement Learning

V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- stra, and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” inarXiv:1312.5602, Dec. 2013, pp. 1–9

work page internal anchor Pith review Pith/arXiv arXiv 2013
[27]

Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping,

A. Ng, D. Harada, and S. J. Russell, “Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping,” inIntl. Conf. on Machine Learning (ICML), Nov. 1999

work page 1999
[28]

Sionna: An Open-Source Library for Next-Generation Physical Layer Research,

J. Hoydis, S. Cammerer, F. A. Aoudia, A. Vem, N. Binder, G. Marcus, and A. Keller, “Sionna: An Open-Source Library for Next-Generation Physical Layer Research,” inarXiv:2203.11854, Mar. 2022

work page arXiv 2022