A formal theory proves model exploitation is essentially unavoidable on large policy sets in RL, generalizes reward hacking results, and derives a safe horizon for a relaxed version of exploitation.
ACM Sigart Bulletin , volume=
7 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
LeJEPA derives an optimal isotropic Gaussian target for embeddings and enforces it via sketched regularization to deliver scalable, heuristics-free self-supervised pretraining with 79% ImageNet linear accuracy on ViT-H/14.
DINO-WM builds world models on pre-trained DINOv2 features to enable zero-shot planning from offline data without rewards or demonstrations.
Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.
citing papers explorer
-
Imperfect World Models are Exploitable
A formal theory proves model exploitation is essentially unavoidable on large policy sets in RL, generalizes reward hacking results, and derives a safe horizon for a relaxed version of exploitation.
-
Learning Interactive Real-World Simulators
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
-
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
-
LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics
LeJEPA derives an optimal isotropic Gaussian target for embeddings and enforces it via sketched regularization to deliver scalable, heuristics-free self-supervised pretraining with 79% ImageNet linear accuracy on ViT-H/14.
-
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
DINO-WM builds world models on pre-trained DINOv2 features to enable zero-shot planning from offline data without rewards or demonstrations.
-
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.
- Learning to Theorize the World from Observation