pith. machine review for the scientific record. sign in

arxiv: 1312.5602 · v1 · submitted 2013-12-19 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Playing Atari with Deep Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-11 07:54 UTC · model grok-4.3

classification 💻 cs.LG
keywords deep reinforcement learningAtari 2600convolutional neural networksQ-learningvalue functioncontrol policiesraw pixelsArcade Learning Environment
0
0 comments X

The pith

A convolutional neural network learns control policies for Atari games directly from raw pixel inputs using reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a deep learning approach that trains a convolutional neural network with a variant of Q-learning to play Atari games. The network takes raw screen pixels as input and outputs estimates of future rewards for different actions. This method is applied uniformly to seven different games without any changes to the architecture or algorithm for each game. It beats all prior methods on six games and exceeds human performance on three. This matters because it demonstrates that reinforcement learning can scale to complex visual environments without relying on hand-engineered features.

Core claim

We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

What carries the argument

A convolutional neural network trained with Q-learning that maps raw pixel inputs to action-value estimates.

If this is right

  • Single fixed architecture succeeds across games with varying dynamics and rewards.
  • Outperforms previous methods on six of seven tested Atari games.
  • Surpasses human expert performance on three games.
  • Learns directly from high-dimensional sensory input without domain knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such models could potentially be adapted to other visual control tasks like robotics.
  • Scaling this approach might enable agents that handle more complex environments.
  • This suggests deep RL can reduce the need for manual feature engineering in game AI.

Load-bearing premise

The assumption that one unchanging convolutional network and Q-learning setup can produce effective policies for games with substantially different reward structures and visual dynamics.

What would settle it

Training the described network on the seven Atari games and measuring if it achieves lower performance than reported on the six games where it was claimed to outperform priors.

read the original abstract

We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims to present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network trained with a variant of Q-learning whose input is raw pixels and output is a value function; experience replay and target networks are used to stabilize training. The same fixed architecture and algorithm (no per-game adjustments) is applied to seven Atari 2600 games from the Arcade Learning Environment, outperforming all previous approaches on six games and surpassing human expert performance on three.

Significance. If the empirical results hold, the work is significant because it shows that deep neural networks can be combined with reinforcement learning to solve control tasks from raw high-dimensional inputs without domain-specific features or tuning. The stabilization techniques (experience replay and periodic target network updates) directly address known divergence problems in deep Q-learning, and the consistent results across diverse games with a single method provide evidence of generality. The detailed description of the architecture, update rule, and use of standard benchmarks (Arcade Learning Environment) supports reproducibility of the central empirical claims.

minor comments (3)
  1. [Section 4] Section 4 (Deep Q-Learning): the loss function and target computation are described in prose; adding an explicit equation for the target value y_j (incorporating the target network) would improve clarity and make the stabilization mechanism easier to follow.
  2. [Table 1] Table 1 and Section 5 (Experiments): average scores are reported, but the number of evaluation episodes per game and any measure of variability (e.g., standard deviation across runs) are not stated; including these would strengthen assessment of the outperformance claims.
  3. [Section 5] Figure 2 (or equivalent training curves): if full learning curves are present only in supplementary material, a brief reference in the main text would help readers understand the stability achieved by the proposed variant.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, accurate summary of the contributions, and recommendation to accept. We are pleased that the significance of combining deep networks with reinforcement learning for high-dimensional control tasks was recognized.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution is an empirical demonstration: a fixed CNN architecture plus stabilized Q-learning (experience replay + target network) is trained end-to-end on raw pixels from the external Arcade Learning Environment and evaluated on held-out game episodes. Performance numbers are measured outcomes on public benchmarks, not quantities defined or fitted to themselves. The update rules follow the standard Bellman equation with two well-motivated stabilizations; neither the architecture nor the algorithm is derived from the reported scores. No self-citation chain, self-definitional loop, or fitted-input-renamed-as-prediction appears in the derivation or results section. The method is externally falsifiable on the same benchmarks.

Axiom & Free-Parameter Ledger

4 free parameters · 2 axioms · 0 invented entities

The central claim rests on empirical training success rather than a closed-form derivation. Standard RL assumptions (Markov property, discounted rewards) and neural network optimization assumptions are used; many hyperparameters are selected by hand or grid search.

free parameters (4)
  • learning rate
    Chosen to ensure stable convergence of the Q-network updates.
  • discount factor gamma
    Set to 0.99; standard value but still a free parameter affecting long-term reward weighting.
  • replay buffer size and sampling
    Hyperparameters controlling experience replay that affect training stability.
  • target network update frequency
    Period chosen to balance stability and learning speed.
axioms (2)
  • domain assumption The environment satisfies the Markov property with respect to the observed pixel frames.
    Invoked when treating raw pixels as sufficient state for Q-learning.
  • domain assumption Gradient descent on the Q-network loss converges to a useful policy under the chosen hyperparameters.
    Relied upon for the training procedure to succeed across games.

pith-pipeline@v0.9.0 · 5393 in / 1417 out tokens · 57999 ms · 2026-05-11T07:54:39.048358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

    cs.CV 2026-04 unverdicted novelty 8.0

    OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

  2. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    cs.AI 2023-06 conditional novelty 8.0

    LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.

  3. Consistency Models

    cs.LG 2023-03 conditional novelty 8.0

    Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

  4. ASH: Agents that Self-Hone via Embodied Learning

    cs.AI 2026-05 unverdicted novelty 7.0

    ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.

  5. Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

    cs.LG 2026-05 unverdicted novelty 7.0

    CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.

  6. TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency

    quant-ph 2026-05 unverdicted novelty 7.0

    TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.

  7. On-line Learning in Tree MDPs by Treating Policies as Bandit Arms

    cs.AI 2026-05 unverdicted novelty 7.0

    Bandit algorithms can be adapted to Tree MDPs by treating policies as arms with shared-data confidence bounds, achieving polynomial memory and instance-dependent bounds on sample complexity and regret that depend on t...

  8. Replay-buffer engineering for noise-robust quantum circuit optimization

    quant-ph 2026-04 unverdicted novelty 7.0

    Treating the replay buffer as a central lever in RL for quantum circuit optimization yields 4-32x sample efficiency gains, up to 67.5% faster episodes, and 85-90% fewer steps to accuracy on noisy molecular and compila...

  9. Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

    cs.AI 2026-04 unverdicted novelty 7.0

    COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

  10. Bounded Ratio Reinforcement Learning

    cs.LG 2026-04 conditional novelty 7.0

    BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.

  11. Reinforcement Learning via Value Gradient Flow

    cs.LG 2026-04 unverdicted novelty 7.0

    VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

  12. Autonomous Diffractometry Enabled by Visual Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.

  13. SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.

  14. Voyager: An Open-Ended Embodied Agent with Large Language Models

    cs.AI 2023-05 unverdicted novelty 7.0

    Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...

  15. Dota 2 with Large Scale Deep Reinforcement Learning

    cs.LG 2019-12 accept novelty 7.0

    OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.

  16. Soft Actor-Critic Algorithms and Applications

    cs.LG 2018-12 unverdicted novelty 7.0

    SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.

  17. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

    cs.LG 2018-01 accept novelty 7.0

    Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.

  18. Continuous control with deep reinforcement learning

    cs.LG 2015-09 accept novelty 7.0

    DDPG is a model-free actor-critic algorithm that learns continuous control policies end-to-end from states or pixels using deterministic policy gradients and deep networks, solving more than 20 physics tasks competiti...

  19. CA2: Code-Aware Agent for Automated Game Testing

    cs.SE 2026-05 unverdicted novelty 6.0

    CA2 integrates call stack information into RL agents for game testing and shows consistent gains over baselines that ignore code signals.

  20. Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

    cs.LG 2026-05 unverdicted novelty 6.0

    Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.

  21. Discrete Flow Matching for Offline-to-Online Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.

  22. DelAC: A Multi-agent Reinforcement Learning of Team-Symmetric Stochastic Games

    cs.MA 2026-05 unverdicted novelty 6.0

    Team-symmetric games always have team-symmetric Nash equilibria solvable via linear complementarity problems, and the DelAC actor-critic MARL algorithm outperforms existing methods in simulations.

  23. Plan2Cleanse: Test-Time Backdoor Defense via Monte-Carlo Planning in Deep Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Plan2Cleanse frames RL backdoor detection as a Monte Carlo planning problem to achieve over 61 percentage point gains in trigger detection and improved win rates in competitive environments.

  24. Learning the Preferences of a Learning Agent

    cs.AI 2026-05 unverdicted novelty 6.0

    Formalizes preference learning from a no-regret or Boltzmann-converging learner with theoretical guarantees or impossibility results for IRL algorithms.

  25. Counter-Dyna: Data-Efficient RL-Based HVAC Control using Counterfactual Building Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Counter-Dyna reduces RL training data for HVAC control to five weeks by using counterfactual surrogate models that ignore uncontrollable variables like weather and prices.

  26. Quantile Geometry Regularization for Distributional Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    RQIQN introduces a Wasserstein DRO-based correction to Bellman quantile targets that enlarges distributional spread without altering risk-neutral averages.

  27. A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

    cs.LG 2026-05 unverdicted novelty 6.0

    MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

  28. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...

  29. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...

  30. Towards Real-time Control of a CartPole System on a Quantum Computer

    quant-ph 2026-05 unverdicted novelty 6.0

    A single-qubit quantum reinforcement learning agent solves CartPole faster than classical networks and quantifies shot-count versus control-frequency requirements for real-time closed-loop control on NISQ hardware, in...

  31. AutoREC: A software platform for developing reinforcement learning agents for equivalent circuit model generation from electrochemical impedance spectroscopy data

    cs.LG 2026-04 unverdicted novelty 6.0

    AutoREC uses a Double Deep Q-Network agent to generate equivalent circuit models from EIS data, reporting over 99.6% success on synthetic sets and generalization to experimental battery, corrosion, and catalysis data.

  32. Improving Zero-Shot Offline RL via Behavioral Task Sampling

    cs.AI 2026-04 unverdicted novelty 6.0

    Extracting task vectors from the offline dataset for policy training improves zero-shot offline RL performance by an average of 20% over random sampling baselines.

  33. Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.

  34. From Particles to Perils: SVGD-Based Hazardous Scenario Generation for Autonomous Driving Systems Testing

    cs.SE 2026-04 unverdicted novelty 6.0

    PtoP uses SVGD to create diverse, failure-inducing seeds for ADS testing, boosting violation rates by up to 27.68% and diversity by 9.6% over baselines.

  35. Scalable Neighborhood-Based Multi-Agent Actor-Critic

    cs.LG 2026-04 unverdicted novelty 6.0

    MADDPG-K scales centralized critics in multi-agent RL by limiting each critic to k-nearest neighbors under Euclidean distance, yielding constant input size and competitive performance.

  36. GRAIL: Autonomous Concept Grounding for Neuro-Symbolic Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 6.0

    GRAIL autonomously grounds relational concepts in NeSy-RL by using LLM weak supervision followed by interaction-based refinement, matching or exceeding manually defined concepts on Atari games.

  37. Soft-Quantum Algorithms

    quant-ph 2026-04 unverdicted novelty 6.0

    Directly training soft-unitary matrices with a unitarity regularization term and converting them to circuits via alignment enables faster training and lower loss than gate-based optimization on small quantum classific...

  38. Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 6.0

    GMRL-BD detects untrustworthy topic boundaries for black-box LLMs by combining bias-diffusion on a Wikipedia KG with multi-agent RL, supported by a released dataset labeling biases in models like Llama2 and Qwen2.

  39. Anticipatory Reinforcement Learning: From Generative Path-Laws to Distributional Value Functions

    cs.LG 2026-04 unverdicted novelty 6.0

    ARL lifts states into signature-augmented manifolds and employs self-consistent proxies of future path-laws to enable deterministic expected-return evaluation while preserving contraction mappings in jump-diffusion en...

  40. Behavior Regularized Offline Reinforcement Learning

    cs.LG 2019-11 unverdicted novelty 6.0

    Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.

  41. Towards A Rigorous Science of Interpretable Machine Learning

    stat.ML 2017-02 unverdicted novelty 6.0

    The authors define interpretability for machine learning, specify when it is required, and propose a taxonomy for its rigorous evaluation while identifying open research questions.

  42. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    cs.LG 2016-09 unverdicted novelty 6.0

    Large-batch methods converge to sharp minima causing a generalization gap, while small-batch methods reach flat minima due to inherent gradient noise.

  43. Active Sensing with Meta-Reinforcement Learning for Emitter Localization from RF Observations

    eess.SP 2026-05 unverdicted novelty 5.0

    A meta-reinforcement learning agent achieves 80.1% success in localizing RF emitters by sequentially sensing the environment with a 2x2 patch antenna in Sionna ray-tracing simulations.

  44. Higher Resolution, Better Generalization: Unlocking Visual Scaling in Deep Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 5.0

    Higher-resolution observations with global-average-pooling encoders improve RL performance and generalization by enabling more localized visual attention, yielding up to 28% gains over standard Impala encoders.

  45. PG-LRF: Physiology-Guided Latent Rectified Flow for Electro-Hemodynamic PPG-to-ECG Generation

    eess.SP 2026-05 unverdicted novelty 5.0

    PG-LRF generates signal-faithful and physiologically plausible ECGs from PPG inputs by structuring a latent space with an electro-hemodynamic simulator and enforcing consistency in a rectified flow model.

  46. Soft Deterministic Policy Gradient with Gaussian Smoothing

    cs.LG 2026-05 unverdicted novelty 5.0

    Soft-DPG uses Gaussian smoothing on the Bellman equation to derive a well-defined policy gradient without relying on critic action derivatives, yielding competitive performance on dense-reward tasks and gains on discr...

  47. E$^2$DT: Efficient and Effective Decision Transformer with Experience-Aware Sampling for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 5.0

    E²DT couples a Decision Transformer with a k-Determinantal Point Process that scores trajectories on return-to-go quantiles, predictive uncertainty, and stage coverage to improve sample efficiency and policy quality i...

  48. GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

    cs.AI 2026-04 unverdicted novelty 5.0

    The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...

  49. A Survey of Multi-Agent Deep Reinforcement Learning with Graph Neural Network-Based Communication

    cs.LG 2026-04 unverdicted novelty 5.0

    A survey of MARL with GNN-based communication that proposes a generalized process to organize and clarify existing methods.

  50. Efficient Reinforcement Learning using Linear Koopman Dynamics for Nonlinear Robotic Systems

    cs.RO 2026-04 unverdicted novelty 5.0

    Koopman-learned linear dynamics enable an online actor-critic RL method that improves sample efficiency and closed-loop performance on nonlinear robotic systems compared with model-free and other model-based baselines.

  51. Aerial Multi-Functional RIS in Fluid Antennas-Aided Full-Duplex Networks: A Self-Optimized Hybrid Deep Reinforcement Learning Approach

    cs.IT 2026-04 unverdicted novelty 5.0

    A hybrid multi-agent DRL framework with attention and meta-optimization jointly tunes beamforming, power, RIS configuration, and positions to achieve higher energy efficiency in aerial MF-RIS and fluid-antenna full-du...

  52. Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production

    cs.AI 2026-04 unverdicted novelty 5.0

    PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.

  53. Labeled TrustSet Guided: Batch Active Learning with Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    BRAL-T uses TrustSet-guided reinforcement learning for batch active learning and reports state-of-the-art results on 10 image classification benchmarks plus 2 fine-tuning tasks.

  54. Semantic-Aware UAV Command and Control for Efficient IoT Data Collection

    cs.RO 2026-04 unverdicted novelty 5.0

    A DDQN policy for UAVs using semantic latent representations from DeepJSCC outperforms greedy and traveling salesman baselines in simulated device coverage and image reconstruction quality.

  55. Joint Knowledge Base Completion and Question Answering by Combining Large Language Models and Small Language Models

    cs.AI 2026-04 unverdicted novelty 5.0

    JCQL uses an SLM-trained KBC model as an action in an LLM agent for KBQA to reduce hallucinations, then fine-tunes the KBC model with KBQA reasoning paths, outperforming baselines on two benchmarks.

  56. Hierarchical Reasoning Model

    cs.AI 2025-06 unverdicted novelty 5.0

    HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples ...

  57. Gymnasium: A Standard Interface for Reinforcement Learning Environments

    cs.LG 2024-07 accept novelty 5.0

    Gymnasium establishes a standardized API for RL environments to improve interoperability, reproducibility, and ease of development in reinforcement learning.

  58. Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation

    cs.CV 2026-05 unverdicted novelty 4.0

    A new CoVQD-guided retrieval-augmented generation framework improves multimodal LLMs on visual question answering by using structured reasoning to retrieve better external knowledge.

  59. Semantic-Aware UAV Command and Control for Efficient IoT Data Collection

    cs.RO 2026-04 unverdicted novelty 4.0

    A semantic-aware UAV framework using DeepJSCC and DDQN outperforms greedy and TSP baselines in device coverage and image reconstruction quality for IoT data collection.

  60. Fuzzy Encoding-Decoding to Improve Spiking Q-Learning Performance in Autonomous Driving

    cs.NE 2026-04 unverdicted novelty 4.0

    A fuzzy encoder-decoder architecture reduces information loss in spiking Q-learning and narrows the performance gap with conventional multi-modal networks on HighwayEnv driving tasks.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 62 Pith papers

  1. [1]

    Residual algorithms: Reinforcement learning with function approximation

    Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning (ICML 1995) , pages 30–37. Morgan Kaufmann, 1995

  2. [2]

    Sketch-based linear value function ap- proximation

    Marc Bellemare, Joel Veness, and Michael Bowling. Sketch-based linear value function ap- proximation. In Advances in Neural Information Processing Systems 25 , pages 2222–2230, 2012

  3. [3]

    The arcade learning environment: An evaluation platform for general agents

    Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013

  4. [4]

    Investigating contingency awareness using atari 2600 games

    Marc G Bellemare, Joel Veness, and Michael Bowling. Investigating contingency awareness using atari 2600 games. In AAAI, 2012

  5. [5]

    Bellemare, Joel Veness, and Michael Bowling

    Marc G. Bellemare, Joel Veness, and Michael Bowling. Bayesian learning of recursively fac- tored environments. In Proceedings of the Thirtieth International Conference on Machine Learning (ICML 2013), pages 1211–1219, 2013. 8

  6. [6]

    Dahl, Dong Yu, Li Deng, and Alex Acero

    George E. Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Pro- cessing, IEEE Transactions on, 20(1):30 –42, January 2012

  7. [7]

    Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. In Proc. ICASSP, 2013

  8. [8]

    A neuro-evolution approach to general atari game playing

    Matthew Hausknecht, Risto Miikkulainen, and Peter Stone. A neuro-evolution approach to general atari game playing. 2013

  9. [9]

    Actor-critic reinforcement learning with energy-based policies

    Nicolas Heess, David Silver, and Yee Whye Teh. Actor-critic reinforcement learning with energy-based policies. In European Workshop on Reinforcement Learning, page 43, 2012

  10. [10]

    What is the best multi-stage architecture for object recognition? In Proc

    Kevin Jarrett, Koray Kavukcuoglu, MarcAurelio Ranzato, and Yann LeCun. What is the best multi-stage architecture for object recognition? In Proc. International Conference on Com- puter Vision and Pattern Recognition (CVPR 2009), pages 2146–2153. IEEE, 2009

  11. [11]

    Imagenet classification with deep con- volutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep con- volutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012

  12. [12]

    Deep auto-encoder neural networks in reinforcement learning

    Sascha Lange and Martin Riedmiller. Deep auto-encoder neural networks in reinforcement learning. In Neural Networks (IJCNN), The 2010 International Joint Conference on , pages 1–8. IEEE, 2010

  13. [13]

    Reinforcement learning for robots using neural networks

    Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report, DTIC Document, 1993

  14. [14]

    Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approxi- mation

    Hamid Maei, Csaba Szepesvari, Shalabh Bhatnagar, Doina Precup, David Silver, and Rich Sutton. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approxi- mation. In Advances in Neural Information Processing Systems 22, pages 1204–1212, 2009

  15. [15]

    Hamid Maei, Csaba Szepesv ´ari, Shalabh Bhatnagar, and Richard S. Sutton. Toward off-policy learning control with function approximation. In Proceedings of the 27th International Con- ference on Machine Learning (ICML 2010), pages 719–726, 2010

  16. [16]

    Machine Learning for Aerial Image Labeling

    V olodymyr Mnih. Machine Learning for Aerial Image Labeling . PhD thesis, University of Toronto, 2013

  17. [17]

    Prioritized sweeping: Reinforcement learning with less data and less real time

    Andrew Moore and Chris Atkeson. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13:103–130, 1993

  18. [18]

    Rectified linear units improve restricted boltzmann ma- chines

    Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann ma- chines. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pages 807–814, 2010

  19. [19]

    Pollack and Alan D

    Jordan B. Pollack and Alan D. Blair. Why did td-gammon work. In Advances in Neural Information Processing Systems 9, pages 10–16, 1996

  20. [20]

    Neural fitted q iteration–first experiences with a data efficient neural re- inforcement learning method

    Martin Riedmiller. Neural fitted q iteration–first experiences with a data efficient neural re- inforcement learning method. In Machine Learning: ECML 2005 , pages 317–328. Springer, 2005

  21. [21]

    Brian Sallans and Geoffrey E. Hinton. Reinforcement learning with factored states and actions. Journal of Machine Learning Research, 5:1063–1088, 2004

  22. [22]

    Pedestrian de- tection with unsupervised multi-stage feature learning

    Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, and Yann LeCun. Pedestrian de- tection with unsupervised multi-stage feature learning. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR 2013). IEEE, 2013

  23. [23]

    Reinforcement Learning: An Introduction

    Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction . MIT Press, 1998

  24. [24]

    Temporal difference learning and td-gammon

    Gerald Tesauro. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995

  25. [25]

    An analysis of temporal-difference learning with function approximation

    John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. Automatic Control, IEEE Transactions on, 42(5):674–690, 1997

  26. [26]

    Q-learning

    Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992. 9