Learning Permutations with Sinkhorn Policy Gradient

Patrick Emami; Sanjay Ranka

arxiv: 1805.07010 · v1 · pith:WMFXDLCFnew · submitted 2018-05-18 · 💻 cs.LG · stat.ML

Learning Permutations with Sinkhorn Policy Gradient

Patrick Emami , Sanjay Ranka This is my paper

classification 💻 cs.LG stat.ML

keywords learningsinkhornpermutationpermutationsactor-criticarchitecturedatagradient

0 comments

read the original abstract

Many problems at the intersection of combinatorics and computer science require solving for a permutation that optimally matches, ranks, or sorts some data. These problems usually have a task-specific, often non-differentiable objective function that data-driven algorithms can use as a learning signal. In this paper, we propose the Sinkhorn Policy Gradient (SPG) algorithm for learning policies on permutation matrices. The actor-critic neural network architecture we introduce for SPG uniquely decouples representation learning of the state space from the highly-structured action space of permutations with a temperature-controlled Sinkhorn layer. The Sinkhorn layer produces continuous relaxations of permutation matrices so that the actor-critic architecture can be trained end-to-end. Our empirical results show that agents trained with SPG can perform competitively on sorting, the Euclidean TSP, and matching tasks. We also observe that SPG is significantly more data efficient at the matching task than the baseline methods, which indicates that SPG is conducive to learning representations that are useful for reasoning about permutations.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning Unbiased Permutations via Flow Matching
cs.LG 2026-05 unverdicted novelty 7.0

PermFlow applies conditional flow matching on the affine subspace of doubly stochastic matrices with a closed-form tangent projector and nearest-target coupling to capture multimodal permutation distributions.
Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems
cs.RO 2026-05 unverdicted novelty 7.0

OLSF-TRS is a generalized sequential decision framework using structured combinatorial optimization and multi-agent reinforcement learning for order-tote-robot coordination in tote-handling robotic systems, with near-...