pith. sign in

Learning Permutations with Sinkhorn Policy Gradient

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

Many problems at the intersection of combinatorics and computer science require solving for a permutation that optimally matches, ranks, or sorts some data. These problems usually have a task-specific, often non-differentiable objective function that data-driven algorithms can use as a learning signal. In this paper, we propose the Sinkhorn Policy Gradient (SPG) algorithm for learning policies on permutation matrices. The actor-critic neural network architecture we introduce for SPG uniquely decouples representation learning of the state space from the highly-structured action space of permutations with a temperature-controlled Sinkhorn layer. The Sinkhorn layer produces continuous relaxations of permutation matrices so that the actor-critic architecture can be trained end-to-end. Our empirical results show that agents trained with SPG can perform competitively on sorting, the Euclidean TSP, and matching tasks. We also observe that SPG is significantly more data efficient at the matching task than the baseline methods, which indicates that SPG is conducive to learning representations that are useful for reasoning about permutations.

fields

cs.LG 1 cs.RO 1

years

2026 2

verdicts

UNVERDICTED 2

representative citing papers

Learning Unbiased Permutations via Flow Matching

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

PermFlow applies conditional flow matching on the affine subspace of doubly stochastic matrices with a closed-form tangent projector and nearest-target coupling to capture multimodal permutation distributions.

citing papers explorer

Showing 2 of 2 citing papers.

  • Learning Unbiased Permutations via Flow Matching cs.LG · 2026-05-16 · unverdicted · none · ref 7 · internal anchor

    PermFlow applies conditional flow matching on the affine subspace of doubly stochastic matrices with a closed-form tangent projector and nearest-target coupling to capture multimodal permutation distributions.

  • Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems cs.RO · 2026-05-09 · unverdicted · none · ref 97

    OLSF-TRS is a generalized sequential decision framework using structured combinatorial optimization and multi-agent reinforcement learning for order-tote-robot coordination in tote-handling robotic systems, with near-optimal performance on small scales and 8-30%+ improvements over heuristics onlarge