Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation

Michael Blumenstein; Prasun Roy; Saumik Bhattacharya; Subhankar Ghosh; Umapada Pal

arxiv: 2502.13637 · v2 · submitted 2025-02-19 · 💻 cs.CV · cs.MM

Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation

Prasun Roy , Saumik Bhattacharya , Subhankar Ghosh , Umapada Pal , Michael Blumenstein This is my paper

Pith reviewed 2026-05-23 02:28 UTC · model grok-4.3

classification 💻 cs.CV cs.MM

keywords human affordance predictioncross-modal attentionvariational autoencoderpose templatescene context2D scene understandingcontext-aware generation

0 comments

The pith

A mutual cross-modal attention mechanism with disentangled VAEs predicts contextually valid human poses in 2D scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to predict novel human poses that represent valid actions within a given 2D scene by learning from surrounding context. It introduces a mutual cross-attention mechanism to encode scene information by having spatial feature maps from two different modalities attend to each other. The method breaks the task into separate steps: a VAE samples a probable person location from global scene context, a classifier selects a pose template from local context around that spot, and two further VAEs sample scale and deformation parameters conditioned on the local context and chosen template. This structured breakdown is intended to manage the vast number of possible pose variations more efficiently than earlier approaches. Experiments indicate the method produces better results than the prior baseline when injecting human affordances into complex scenes.

Core claim

The central claim is that a novel cross-attention mechanism for mutual attention on spatial feature maps from two modalities, integrated with a disentangled pipeline of VAEs for location sampling, template classification, and scale/deformation sampling conditioned on local context, allows for effective human affordance generation in 2D scenes, yielding significant improvements over prior baselines.

What carries the argument

Mutual cross-modal attention mechanism that encodes the scene context for affordance prediction by mutually attending spatial feature maps from two different modalities, together with the disentangled subtask pipeline using VAEs and a classifier.

Load-bearing premise

The exponentially large space of possible poses and actions can be effectively handled by disentangling the problem into location sampling via VAE, template classification, and separate VAEs for scale and deformation conditioned on local context.

What would settle it

A direct comparison on a held-out set of complex 2D scenes where the method shows no improvement or lower accuracy than the baseline in generating valid human affordances would falsify the claim.

Figures

Figures reproduced from arXiv: 2502.13637 by Michael Blumenstein, Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal.

**Figure 1.** Figure 1: Overview of the proposed method. Left: Predicted locations for a new person in the scene. Middle: Estimated scale at each predicted location. Right: Final human pose estimated after scaling and deformation at each predicted location. The remainder of the paper is organized as follows. We discuss the relevant literature in Sec. II. The proposed approach is discussed in Sec. III. Sec. IV describes the datas… view at source ↗

**Figure 2.** Figure 2: Architecture of the Mutual Cross-Modal Attention (MCMA) block. location within the scene where a person can be centered. In the second stage, a classifier predicts the most likely template pose for the estimated location from a set of existing human pose candidates. In the subsequent stages, we use two conditional VAEs to sample the scale and linear deformation parameters for the predicted template [PITH_… view at source ↗

**Figure 3.** Figure 3: An illustration of the proposed architecture. The workflow is divided into four subnetworks to estimate the probable [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of the proposed method with existing human affordance generation techniques by Wang [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of the learned distribution. (Left) Input scene. (Middle) Distribution of standing poses. (Right) Distribution of sitting poses. input stream. Specifically, for configurations Cross-D/I and Cross-S/I, we compute query (Q) from D and S, respectively, while retrieving key (K) and value (V ) from I. Likewise, for configurations Cross-I/D and Cross-I/S, we compute Q from I, while estimating K an… view at source ↗

**Figure 6.** Figure 6: Qualitative ablation analysis of the proposed network architecture with different input modalities. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 8.** Figure 8: Visual examples of downstream rendering of human [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

read the original abstract

Human affordance learning investigates contextually relevant novel pose prediction such that the estimated pose represents a valid human action within the scene. While the task is fundamental to machine perception and automated interactive navigation agents, the exponentially large number of probable pose and action variations make the problem challenging and non-trivial. However, the existing datasets and methods for human affordance prediction in 2D scenes are significantly limited in the literature. In this paper, we propose a novel cross-attention mechanism to encode the scene context for affordance prediction by mutually attending spatial feature maps from two different modalities. The proposed method is disentangled among individual subtasks to efficiently reduce the problem complexity. First, we sample a probable location for a person within the scene using a variational autoencoder (VAE) conditioned on the global scene context encoding. Next, we predict a potential pose template from a set of existing human pose candidates using a classifier on the local context encoding around the predicted location. In the subsequent steps, we use two VAEs to sample the scale and deformation parameters for the predicted pose template by conditioning on the local context and template class. Our experiments show significant improvements over the previous baseline of human affordance injection into complex 2D scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper puts forward a mutual cross-attention block inside a four-stage VAE-plus-classifier pipeline for 2D human affordance generation, but the abstract supplies no numbers, datasets, or ablations to show the gains are real.

read the letter

The new element is the mutual cross-attention between two modalities to condition the scene context, combined with the explicit split into global VAE for location, local classifier for pose template, and separate VAEs for scale and deformation. That decomposition is presented as a way to tame the large space of possible poses without having to model everything jointly at once. The approach is straightforward to follow and gives a clear recipe for how the pieces fit together, which is useful for anyone trying to build on prior affordance work in images. The cross-attention itself looks like a reasonable extension of existing attention patterns rather than a complete reinvention. The main weakness is that the abstract claims significant improvements over the baseline yet gives zero metrics, no dataset names, no baseline details, and no error analysis. Without those, it is impossible to tell whether the staged sampling actually captures the necessary correlations or whether the reported gains are tied to particular test scenes. The stress-test point about missing joint dependencies between location, template, scale, and deformation is worth checking in the full experiments; if the conditioning and attention do not propagate information across stages, the outputs could still be invalid even if the individual modules train cleanly. This is a narrow computer-vision paper aimed at people already working on scene understanding or interactive agents. A reader who needs a concrete pipeline for 2D affordance generation could extract the method description and try to reproduce it. The work is coherent on its own terms and shows honest engagement with the task structure, so it clears the bar for a serious referee even though the current evidence is thin. I would send it out for review rather than desk-reject, mainly to get the numbers and ablations on the table.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a mutual cross-modal attention mechanism for human affordance generation in 2D scenes. It disentangles the task by using a VAE conditioned on global scene context to sample person location, a classifier on local context to select a pose template from candidates, and two additional VAEs to sample scale and deformation parameters conditioned on local context and template class. The abstract asserts that experiments demonstrate significant improvements over a prior baseline for affordance injection into complex scenes.

Significance. If the factorization and cross-attention approach can be shown to produce valid, diverse affordances with measurable gains, the work would offer a structured way to manage the combinatorial complexity of pose prediction for applications in scene understanding and navigation agents. The explicit separation of global location sampling from local template and parameter generation is a clear design choice that could be reusable if empirically supported.

major comments (2)

[Abstract] Abstract: the statement that 'our experiments show significant improvements over the previous baseline' supplies no quantitative metrics, dataset names, baseline descriptions, ablation results, or error analysis. This absence directly prevents verification of the central empirical claim.
[Method] Method description (disentanglement steps): the claim that separate VAEs and a classifier conditioned only on local context plus template class suffice to cover the joint distribution over location, template, scale, and deformation is load-bearing for the efficiency argument, yet no ablation on cross-component correlations or diversity metrics is referenced to test whether the factorization produces valid affordances or merely artifacts of the test scenes.

minor comments (1)

[Abstract] Abstract: the phrase 'exponentially large number of probable pose and action variations' is repeated without a supporting citation or complexity argument; a single reference to prior work on pose space size would clarify the motivation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate planned revisions to the next version.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that 'our experiments show significant improvements over the previous baseline' supplies no quantitative metrics, dataset names, baseline descriptions, ablation results, or error analysis. This absence directly prevents verification of the central empirical claim.

Authors: We agree that the abstract is too concise and omits key details needed to substantiate the central claim. In the revised manuscript we will expand the abstract to report specific quantitative metrics (e.g., improvement percentages on the primary evaluation metric), name the dataset and baseline, and point to the ablation and error-analysis sections already present in the body of the paper. revision: yes
Referee: [Method] Method description (disentanglement steps): the claim that separate VAEs and a classifier conditioned only on local context plus template class suffice to cover the joint distribution over location, template, scale, and deformation is load-bearing for the efficiency argument, yet no ablation on cross-component correlations or diversity metrics is referenced to test whether the factorization produces valid affordances or merely artifacts of the test scenes.

Authors: The factorization is introduced precisely to manage the combinatorial explosion of the joint distribution; each component is conditioned on the appropriate context (global for location, local plus template class for the remaining parameters) and the mutual cross-attention supplies the necessary scene encoding. While the current version does not contain dedicated ablations on inter-component statistical dependence, the reported diversity and validity metrics already indicate that the generated affordances are both plausible and varied. To directly address the concern we will add a short discussion of the independence assumptions together with an ablation that measures the effect of removing cross-component conditioning. revision: yes

Circularity Check

0 steps flagged

No circularity: method is a proposed architecture without self-referential derivations

full rationale

The paper describes a disentangled pipeline (global VAE for location sampling, local classifier for template, separate VAEs for scale/deformation, plus mutual cross-attention) and reports experimental improvements over a baseline. No equations, uniqueness theorems, or self-citations are invoked in the provided text that would reduce any claimed prediction or result to a fitted input or prior author work by construction. The architecture choices are presented as design decisions rather than derived necessities, and the central claim rests on empirical comparison rather than any load-bearing self-referential step. This matches the default expectation for non-circular papers in the absence of mathematical derivations.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions of variational autoencoders and attention mechanisms from prior literature; no new entities are postulated. Free parameters such as latent dimensions and conditioning details are implicit but unspecified in the abstract.

free parameters (2)

VAE latent space dimensions
Standard VAE hyperparameter required for sampling location, scale, and deformation but not quantified in abstract.
Number of pose template candidates
Classifier operates over an existing set whose size affects complexity reduction but is not stated.

axioms (2)

domain assumption Scene context can be encoded into global and local feature maps that are mutually informative for affordance.
Invoked in the description of the cross-attention mechanism and conditioning steps.
domain assumption Disentangling location, template, scale, and deformation sufficiently reduces the combinatorial complexity of human poses.
Stated as the rationale for the multi-stage pipeline in the abstract.

pith-pipeline@v0.9.0 · 5760 in / 1212 out tokens · 22410 ms · 2026-05-23T02:28:48.900487+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

[1]

J. J. Gibson, The Ecological Approach to Visual Perception. Houghton Mifflin, 1979. 1, 2

work page 1979
[2]

Reasoning about object affordances in a knowledge base representation,

Y . Zhu, A. Fathi, and F.-F. Li, “Reasoning about object affordances in a knowledge base representation,” in European Conference on Computer Vision (ECCV), 2014. 1, 2

work page 2014
[3]

Learning to act properly: Predicting and explaining affordances from images,

C.-Y . Chuang, J. Li, A. Torralba, and S. Fidler, “Learning to act properly: Predicting and explaining affordances from images,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 1, 2 EXPLORING MUTUAL CROSS-MODAL ATTENTION FOR CONTEXT-AW ARE HUMAN AFFORDANCE GENERATION 11

work page 2018
[4]

Binge watching: Scaling affordance learning from sitcoms,

X. Wang, R. Girdhar, and A. Gupta, “Binge watching: Scaling affordance learning from sitcoms,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2017. 1, 2, 6, 7, 8

work page 2017
[5]

Scene-aware generative network for human motion synthesis,

J. Wang, S. Yan, B. Dai, and D. Lin, “Scene-aware generative network for human motion synthesis,” in The IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) , 2021. 1, 2

work page 2021
[6]

Inpaint2Learn: A self-supervised framework for affordance learning,

L. Zhang, W. Du, S. Zhou, J. Wang, and J. Shi, “Inpaint2Learn: A self-supervised framework for affordance learning,” in The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , 2022. 1, 2, 6, 7, 8

work page 2022
[7]

Scene-aware human pose generation using transformer,

J. Yao, J. Chen, L. Niu, and B. Sheng, “Scene-aware human pose generation using transformer,” in ACM International Conference on Multimedia (MM), 2023. 1, 2, 6, 7, 8

work page 2023
[8]

Attentional processes link perception and action,

S. J. Anderson, N. Yamagishi, and V . Karavia, “Attentional processes link perception and action,” Proceedings of the Royal Society of London. Series B: Biological Sciences , 2002. 2

work page 2002
[9]

A multi-scale cnn for affordance segmentation in rgb images,

A. Roy and S. Todorovic, “A multi-scale cnn for affordance segmentation in rgb images,” in European Conference on Computer Vision (ECCV) ,

work page
[10]

AffordanceNet: An end-to-end deep learning approach for object affordance detection,

T.-T. Do, A. Nguyen, and I. Reid, “AffordanceNet: An end-to-end deep learning approach for object affordance detection,” in The IEEE International Conference on Robotics and Automation (ICRA) , 2018. 2

work page 2018
[11]

HP-GAN: Probabilistic 3d human motion prediction via gan,

E. Barsoum, J. Kender, and Z. Liu, “HP-GAN: Probabilistic 3d human motion prediction via gan,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , 2018. 2

work page 2018
[12]

Deep video generation, prediction and completion of human action sequences,

H. Cai, C. Bai, Y .-W. Tai, and C.-K. Tang, “Deep video generation, prediction and completion of human action sequences,” in European Conference on Computer Vision (ECCV) , 2018. 2

work page 2018
[13]

Pose guided human video generation,

C. Yang, Z. Wang, X. Zhu, C. Huang, J. Shi, and D. Lin, “Pose guided human video generation,” in European Conference on Computer Vision (ECCV), 2018. 2

work page 2018
[14]

Convolutional sequence generation for skeleton-based action synthesis,

S. Yan, Z. Li, Y . Xiong, H. Yan, and D. Lin, “Convolutional sequence generation for skeleton-based action synthesis,” in The IEEE/CVF In- ternational Conference on Computer Vision (ICCV) , 2019. 2

work page 2019
[15]

Action2Motion: Conditioned generation of 3d human motions,

C. Guo, X. Zuo, S. Wang, S. Zou, Q. Sun, A. Deng, M. Gong, and L. Cheng, “Action2Motion: Conditioned generation of 3d human motions,” in ACM International Conference on Multimedia (MM), 2020. 2

work page 2020
[16]

ImageNet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in The IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) , 2009. 2, 3

work page 2009
[17]

Very deep convolutional networks for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in The International Conference on Learning Representations (ICLR) , 2015. 2, 3

work page 2015
[18]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems (NeurIPS) , 2017. 3

work page 2017
[19]

Non-local neural net- works,

X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- works,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 3

work page 2018
[20]

One- former: One transformer to rule universal image segmentation,

J. Jain, J. Li, M. T. Chiu, A. Hassani, N. Orlov, and H. Shi, “One- former: One transformer to rule universal image segmentation,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3, 8

work page 2023
[21]

Dilated neighborhood attention transformer,

A. Hassani and H. Shi, “Dilated neighborhood attention transformer,” arXiv preprint arXiv:2209.15001 , 2022. 3

work page arXiv 2022
[22]

Semantic understanding of scenes through the ade20k dataset,

B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,” International Journal of Computer Vision (IJCV) , 2019. 3

work page 2019
[23]

Root mean square layer normalization,

B. Zhang and R. Sennrich, “Root mean square layer normalization,” Advances in Neural Information Processing Systems (NeurIPS) , 2019. 3

work page 2019
[24]

Auto-encoding variational bayes,

D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in International Conference on Learning Representations (ICLR) , 2014. 4

work page 2014
[25]

On information and sufficiency,

S. Kullback and R. A. Leibler, “On information and sufficiency,” The Annals of Mathematical Statistics , 1951. 4

work page 1951
[26]

Adam: A method for stochastic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR) , 2015. 4, 5

work page 2015
[27]

Realtime multi-person 2d pose estimation using part affinity fields,

Z. Cao, T. Simon, S.-E. Wei, and Y . Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2017. 4

work page 2017
[28]

Deep high-resolution representa- tion learning for human pose estimation,

K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representa- tion learning for human pose estimation,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2019. 4

work page 2019
[29]

ViTPose: Simple vision transformer baselines for human pose estimation,

Y . Xu, J. Zhang, Q. Zhang, and D. Tao, “ViTPose: Simple vision transformer baselines for human pose estimation,” Advances in Neural Information Processing Systems (NeurIPS) , 2022. 4

work page 2022
[30]

Human pose as compositional tokens,

Z. Geng, C. Wang, Y . Wei, Z. Liu, H. Li, and H. Hu, “Human pose as compositional tokens,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2023. 4

work page 2023
[31]

Scene aware person image generation through global contextual conditioning,

P. Roy, S. Ghosh, S. Bhattacharya, U. Pal, and M. Blumenstein, “Scene aware person image generation through global contextual conditioning,” in International Conference on Pattern Recognition (ICPR) , 2022. 4

work page 2022
[32]

A simple and fast algorithm for k-medoids clustering,

H.-S. Park and C.-H. Jun, “A simple and fast algorithm for k-medoids clustering,” Expert Systems with Applications , 2009. 4

work page 2009
[33]

2d human pose estimation: New benchmark and state of the art analysis,

M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 5, 6

work page 2014
[34]

TopNet: Transformer-based object placement network for image compositing,

S. Zhu, Z. Lin, S. Cohen, J. Kuen, Z. Zhang, and C. Chen, “TopNet: Transformer-based object placement network for image compositing,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 6

work page 2023
[35]

Resolution-robust large mask inpainting with fourier convolutions,

R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V . Lempitsky, “Resolution-robust large mask inpainting with fourier convolutions,” in The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022. 6

work page 2022
[36]

Articulated human detection with flexible mixtures of parts,

Y . Yang and D. Ramanan, “Articulated human detection with flexible mixtures of parts,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2012. 6

work page 2012
[37]

UniPose: Unified human pose estimation in single images and videos,

B. Artacho and A. Savakis, “UniPose: Unified human pose estimation in single images and videos,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2020. 6, 8

work page 2020
[38]

Pose recognition with cascade transformers,

K. Li, S. Wang, X. Zhang, Y . Xu, W. Xu, and Z. Tu, “Pose recognition with cascade transformers,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2021. 6, 8

work page 2021
[39]

Learning object placement by inpainting for compositional data augmentation,

L. Zhang, T. Wen, J. Min, J. Wang, D. Han, and J. Shi, “Learning object placement by inpainting for compositional data augmentation,” in European Conference on Computer Vision (ECCV) , 2020. 6, 8

work page 2020
[40]

Learning object placement via dual-path graph completion,

S. Zhou, L. Liu, L. Niu, and L. Zhang, “Learning object placement via dual-path graph completion,” in European Conference on Computer Vision (ECCV), 2022. 6, 8

work page 2022
[41]

Depth anything: Unleashing the power of large-scale unlabeled data,

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 8

work page 2024
[42]

Person image synthesis via denoising diffusion model,

A. K. Bhunia, S. Khan, H. Cholakkal, R. M. Anwer, J. Laaksonen, M. Shah, and F. S. Khan, “Person image synthesis via denoising diffusion model,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2023. 10

work page 2023

[1] [1]

J. J. Gibson, The Ecological Approach to Visual Perception. Houghton Mifflin, 1979. 1, 2

work page 1979

[2] [2]

Reasoning about object affordances in a knowledge base representation,

Y . Zhu, A. Fathi, and F.-F. Li, “Reasoning about object affordances in a knowledge base representation,” in European Conference on Computer Vision (ECCV), 2014. 1, 2

work page 2014

[3] [3]

Learning to act properly: Predicting and explaining affordances from images,

C.-Y . Chuang, J. Li, A. Torralba, and S. Fidler, “Learning to act properly: Predicting and explaining affordances from images,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 1, 2 EXPLORING MUTUAL CROSS-MODAL ATTENTION FOR CONTEXT-AW ARE HUMAN AFFORDANCE GENERATION 11

work page 2018

[4] [4]

Binge watching: Scaling affordance learning from sitcoms,

X. Wang, R. Girdhar, and A. Gupta, “Binge watching: Scaling affordance learning from sitcoms,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2017. 1, 2, 6, 7, 8

work page 2017

[5] [5]

Scene-aware generative network for human motion synthesis,

J. Wang, S. Yan, B. Dai, and D. Lin, “Scene-aware generative network for human motion synthesis,” in The IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) , 2021. 1, 2

work page 2021

[6] [6]

Inpaint2Learn: A self-supervised framework for affordance learning,

L. Zhang, W. Du, S. Zhou, J. Wang, and J. Shi, “Inpaint2Learn: A self-supervised framework for affordance learning,” in The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , 2022. 1, 2, 6, 7, 8

work page 2022

[7] [7]

Scene-aware human pose generation using transformer,

J. Yao, J. Chen, L. Niu, and B. Sheng, “Scene-aware human pose generation using transformer,” in ACM International Conference on Multimedia (MM), 2023. 1, 2, 6, 7, 8

work page 2023

[8] [8]

Attentional processes link perception and action,

S. J. Anderson, N. Yamagishi, and V . Karavia, “Attentional processes link perception and action,” Proceedings of the Royal Society of London. Series B: Biological Sciences , 2002. 2

work page 2002

[9] [9]

A multi-scale cnn for affordance segmentation in rgb images,

A. Roy and S. Todorovic, “A multi-scale cnn for affordance segmentation in rgb images,” in European Conference on Computer Vision (ECCV) ,

work page

[10] [10]

AffordanceNet: An end-to-end deep learning approach for object affordance detection,

T.-T. Do, A. Nguyen, and I. Reid, “AffordanceNet: An end-to-end deep learning approach for object affordance detection,” in The IEEE International Conference on Robotics and Automation (ICRA) , 2018. 2

work page 2018

[11] [11]

HP-GAN: Probabilistic 3d human motion prediction via gan,

E. Barsoum, J. Kender, and Z. Liu, “HP-GAN: Probabilistic 3d human motion prediction via gan,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , 2018. 2

work page 2018

[12] [12]

Deep video generation, prediction and completion of human action sequences,

H. Cai, C. Bai, Y .-W. Tai, and C.-K. Tang, “Deep video generation, prediction and completion of human action sequences,” in European Conference on Computer Vision (ECCV) , 2018. 2

work page 2018

[13] [13]

Pose guided human video generation,

C. Yang, Z. Wang, X. Zhu, C. Huang, J. Shi, and D. Lin, “Pose guided human video generation,” in European Conference on Computer Vision (ECCV), 2018. 2

work page 2018

[14] [14]

Convolutional sequence generation for skeleton-based action synthesis,

S. Yan, Z. Li, Y . Xiong, H. Yan, and D. Lin, “Convolutional sequence generation for skeleton-based action synthesis,” in The IEEE/CVF In- ternational Conference on Computer Vision (ICCV) , 2019. 2

work page 2019

[15] [15]

Action2Motion: Conditioned generation of 3d human motions,

C. Guo, X. Zuo, S. Wang, S. Zou, Q. Sun, A. Deng, M. Gong, and L. Cheng, “Action2Motion: Conditioned generation of 3d human motions,” in ACM International Conference on Multimedia (MM), 2020. 2

work page 2020

[16] [16]

ImageNet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in The IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) , 2009. 2, 3

work page 2009

[17] [17]

Very deep convolutional networks for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in The International Conference on Learning Representations (ICLR) , 2015. 2, 3

work page 2015

[18] [18]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems (NeurIPS) , 2017. 3

work page 2017

[19] [19]

Non-local neural net- works,

X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- works,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 3

work page 2018

[20] [20]

One- former: One transformer to rule universal image segmentation,

J. Jain, J. Li, M. T. Chiu, A. Hassani, N. Orlov, and H. Shi, “One- former: One transformer to rule universal image segmentation,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3, 8

work page 2023

[21] [21]

Dilated neighborhood attention transformer,

A. Hassani and H. Shi, “Dilated neighborhood attention transformer,” arXiv preprint arXiv:2209.15001 , 2022. 3

work page arXiv 2022

[22] [22]

Semantic understanding of scenes through the ade20k dataset,

B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,” International Journal of Computer Vision (IJCV) , 2019. 3

work page 2019

[23] [23]

Root mean square layer normalization,

B. Zhang and R. Sennrich, “Root mean square layer normalization,” Advances in Neural Information Processing Systems (NeurIPS) , 2019. 3

work page 2019

[24] [24]

Auto-encoding variational bayes,

D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in International Conference on Learning Representations (ICLR) , 2014. 4

work page 2014

[25] [25]

On information and sufficiency,

S. Kullback and R. A. Leibler, “On information and sufficiency,” The Annals of Mathematical Statistics , 1951. 4

work page 1951

[26] [26]

Adam: A method for stochastic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR) , 2015. 4, 5

work page 2015

[27] [27]

Realtime multi-person 2d pose estimation using part affinity fields,

Z. Cao, T. Simon, S.-E. Wei, and Y . Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2017. 4

work page 2017

[28] [28]

Deep high-resolution representa- tion learning for human pose estimation,

K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representa- tion learning for human pose estimation,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2019. 4

work page 2019

[29] [29]

ViTPose: Simple vision transformer baselines for human pose estimation,

Y . Xu, J. Zhang, Q. Zhang, and D. Tao, “ViTPose: Simple vision transformer baselines for human pose estimation,” Advances in Neural Information Processing Systems (NeurIPS) , 2022. 4

work page 2022

[30] [30]

Human pose as compositional tokens,

Z. Geng, C. Wang, Y . Wei, Z. Liu, H. Li, and H. Hu, “Human pose as compositional tokens,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2023. 4

work page 2023

[31] [31]

Scene aware person image generation through global contextual conditioning,

P. Roy, S. Ghosh, S. Bhattacharya, U. Pal, and M. Blumenstein, “Scene aware person image generation through global contextual conditioning,” in International Conference on Pattern Recognition (ICPR) , 2022. 4

work page 2022

[32] [32]

A simple and fast algorithm for k-medoids clustering,

H.-S. Park and C.-H. Jun, “A simple and fast algorithm for k-medoids clustering,” Expert Systems with Applications , 2009. 4

work page 2009

[33] [33]

2d human pose estimation: New benchmark and state of the art analysis,

M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 5, 6

work page 2014

[34] [34]

TopNet: Transformer-based object placement network for image compositing,

S. Zhu, Z. Lin, S. Cohen, J. Kuen, Z. Zhang, and C. Chen, “TopNet: Transformer-based object placement network for image compositing,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 6

work page 2023

[35] [35]

Resolution-robust large mask inpainting with fourier convolutions,

R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V . Lempitsky, “Resolution-robust large mask inpainting with fourier convolutions,” in The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022. 6

work page 2022

[36] [36]

Articulated human detection with flexible mixtures of parts,

Y . Yang and D. Ramanan, “Articulated human detection with flexible mixtures of parts,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2012. 6

work page 2012

[37] [37]

UniPose: Unified human pose estimation in single images and videos,

B. Artacho and A. Savakis, “UniPose: Unified human pose estimation in single images and videos,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2020. 6, 8

work page 2020

[38] [38]

Pose recognition with cascade transformers,

K. Li, S. Wang, X. Zhang, Y . Xu, W. Xu, and Z. Tu, “Pose recognition with cascade transformers,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2021. 6, 8

work page 2021

[39] [39]

Learning object placement by inpainting for compositional data augmentation,

L. Zhang, T. Wen, J. Min, J. Wang, D. Han, and J. Shi, “Learning object placement by inpainting for compositional data augmentation,” in European Conference on Computer Vision (ECCV) , 2020. 6, 8

work page 2020

[40] [40]

Learning object placement via dual-path graph completion,

S. Zhou, L. Liu, L. Niu, and L. Zhang, “Learning object placement via dual-path graph completion,” in European Conference on Computer Vision (ECCV), 2022. 6, 8

work page 2022

[41] [41]

Depth anything: Unleashing the power of large-scale unlabeled data,

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 8

work page 2024

[42] [42]

Person image synthesis via denoising diffusion model,

A. K. Bhunia, S. Khan, H. Cholakkal, R. M. Anwer, J. Laaksonen, M. Shah, and F. S. Khan, “Person image synthesis via denoising diffusion model,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2023. 10

work page 2023