DenoiseRank: Learning to Rank by Diffusion Models

Preslav Nakov; Shangsong Liang; Ying Wang

arxiv: 2604.20852 · v1 · submitted 2026-02-17 · 💻 cs.IR · cs.AI

DenoiseRank: Learning to Rank by Diffusion Models

Ying Wang , Preslav Nakov , Shangsong Liang This is my paper

Pith reviewed 2026-05-15 22:12 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords learning to rankdiffusion modelsgenerative modelsinformation retrievalranking distributiondenoising process

0 comments

The pith

A diffusion model learns to rank by reversing noise added to relevance labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that learning to rank can be reformulated as a generative diffusion task rather than a discriminative one. It does so by proposing DenoiseRank, which adds noise to relevant labels in a forward process and learns to denoise them in reverse, conditioned on query and document features, to recover the ranking distribution. This matters because most existing LTR models output point estimates, while a generative model could capture the full probability distribution over possible rankings. If the approach holds, it creates a new class of methods for ranking that start from noise and build up to accurate relevance predictions.

Core claim

DenoiseRank addresses traditional learning to rank from a generative perspective using diffusion models. In the forward diffusion process, noise is added to the relevant labels. In the reverse process, the model denoises these labels based on the query and documents to accurately predict their distribution over the documents. The model is shown to be effective through experiments on benchmark datasets, establishing a new benchmark for generative LTR.

What carries the argument

The diffusion-based denoising process that recovers relevance distributions from noisy labels conditioned on queries and documents.

If this is right

The model predicts a full distribution over rankings for each query rather than single scores.
It serves as a benchmark for future generative approaches to LTR.
Effectiveness is demonstrated on standard benchmark datasets.
It enables LTR without relying solely on discriminative classifiers or regressors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Diffusion models for ranking might naturally support sampling varied rankings to promote result diversity.
The technique could be adapted to preference learning in recommender systems by diffusing user feedback labels.
Integrating diffusion steps with existing LTR features might improve handling of sparse or noisy training data.

Load-bearing premise

That the relevance distribution over documents for a query can be accurately recovered by reversing a diffusion process applied to noisy relevant labels.

What would settle it

A test where the model is trained and evaluated on data with relevance labels generated from a non-Markovian process that diffusion models cannot represent, and it underperforms standard LTR baselines.

Figures

Figures reproduced from arXiv: 2604.20852 by Preslav Nakov, Shangsong Liang, Ying Wang.

**Figure 1.** Figure 1: The left panel illustrates the diffusion and reverse processes in DenoiseRank, while the right panel shows [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: NDCG@K of DenoiseRank with different learning rates on Miscrosoft Web30k, Yahoo! and Istella [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Curve of training loss of DenoiseRank(Left) and Rankformer(Right) on MS Web30K and Yahoo! datasets [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Control group experimental results for effective feature counts, NDCG@K performance produce by [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Statistical and Experimental Results of Query-Document Length analysis. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 4.** Figure 4: NDCG@K of DenoiseRank at different noise we denote diversity in LTR task [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 4.** Figure 4: NDCG@K of DenoiseRank at different noise schedule on Miscrosoft Web30k, Yahoo! and Istella Figure 5: NDCG@K of DenoiseRank at different denoise-net design on Miscrosoft datasets Figure 8: NDCG@K of DenoiseRank at different denoise-net design on Miscrosoft Web30k, Yahoo! and Istella [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: NDCG@K of DenoiseRank at different denoise-net design on Miscrosoft Web30k, Yahoo! and Istella noisenet design on Miscrosoft Web30kYahoo! and Istella [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 9.** Figure 9: NDCG@K of DenoiseRank without and with self attention on Miscrosoft Web30k, Yahoo! and Istella datasets. (1) RSD considers M times inference and M sequences of items, while traditional diversity metrics focus on single-ranking sequence of items. (2) RSD focus on the diversity of the sequences, while traditional diversity metrics consider the similarity between items in a single sequence. For instance, DI… view at source ↗

**Figure 10.** Figure 10: A t-SNE plot shows the diverse ranking sequences on the top 20 predicted in the inference stage of a single query randomly selected. The blue points denote the ranking sequences inferred by DenoiseRank using 100 different YT values from Gaussian noise. The orange yellow represents the other sequences predicted by Rankformer in 100 attempts. Testing was conducted on the MS Web30K dataset. 22 [PITH_FULL_IM… view at source ↗

read the original abstract

Learning to rank (LTR) is one of the core tasks in Machine Learning. Traditional LTR models have made great progress, but nearly all of them are implemented from discriminative perspective. In this paper, we aim at addressing LTR from a novel perspective, i.e., by a deep generative model. Specifically, we propose a novel denoise rank model, DenoiseRank, which noises the relevant labels in the diffusion process and denoises them on the query documents in the reverse process to accurately predict their distribution. Our model is the first to address traditional LTR from generative perspective and is a diffusion method for LTR. Our extensive experiments on benchmark datasets demonstrated the effectiveness of DenoiseRank, and we believe it provides a benchmark for generative LTR task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DenoiseRank tries diffusion on relevance labels for LTR but skips the math showing it recovers actual ranking distributions.

read the letter

The main thing to know is that this paper reframes learning to rank as a generative diffusion task: noise is added to relevance labels in a forward process, then a model learns to denoise them conditioned on query-document features to recover the ranking distribution. They position it as the first diffusion-based generative method for standard LTR, which is a genuine shift from the usual discriminative setups. Experiments on benchmark datasets are reported to show competitive or better results than baselines, so there is at least some empirical backing for the approach. Extending diffusion ideas into ranking is not an obvious move and deserves credit for the attempt. The soft spots are in the core technical justification. Relevance labels are discrete, yet diffusion models are built for continuous spaces, and the paper does not supply a derivation or error bound for how the reverse process maps back to valid rankings or respects properties like transitivity. The abstract and available details leave the approximation unproven, so the claim that it accurately predicts the true conditional distribution rests heavily on the experiments rather than on shown convergence. This is the kind of gap a referee would need to see addressed. The work is aimed at information retrieval researchers already comfortable with both LTR and generative models. A reader looking for new angles on ranking might pick up the concept and the benchmark results, but it is unlikely to be cited widely until the mapping from discrete labels to diffusion is made rigorous. It deserves peer review so the authors can fill in the missing steps on the generative claim.

Referee Report

3 major / 2 minor

Summary. The paper introduces DenoiseRank, a diffusion-based generative model for learning to rank (LTR). It applies a forward noising process to relevance labels and a learned reverse denoising process conditioned on query-document features to recover the conditional ranking distribution, claiming to be the first generative diffusion approach to traditional LTR and reporting effectiveness on benchmark datasets.

Significance. If the central claim holds with rigorous justification, the work could establish a new generative paradigm for LTR that models ranking distributions rather than point estimates, potentially improving robustness to label noise and uncertainty. The experiments on benchmarks would then provide a useful reference point for future generative LTR methods.

major comments (3)

[Abstract and §3] Abstract and §3 (Proposed Method): The claim that noising discrete relevance labels followed by denoising on query-document features 'accurately predict[s] their distribution' lacks any derivation showing that the reverse process recovers the true P(y|q,D) or respects ranking invariants such as transitivity. Standard diffusion is defined on continuous spaces; the paper must specify the embedding/relaxation of ordinal labels and bound the approximation error.
[§4] §4 (Experiments): No details are provided on the diffusion schedule, the exact form of the denoising network, how discrete labels are mapped into the continuous diffusion process, or any error analysis (e.g., KL divergence to ground-truth ranking distributions). Without these, it is impossible to verify whether the reported effectiveness stems from the generative formulation or from standard LTR components.
[§3.2] §3.2 (Reverse Process): The training objective is not shown to be equivalent to maximizing the likelihood of the true ranking distribution; if the denoiser is trained only on a simplified DDPM-style loss, the recovered samples may not correspond to valid permutations or scores for arbitrary queries.

minor comments (2)

[Abstract] The abstract states the model is 'the first' without citing prior generative LTR work (e.g., variational or flow-based ranking models); a brief related-work paragraph should be added.
[§3.1] Notation for relevance labels (typically integers 0–4) and their diffusion embedding should be introduced consistently in §3.1.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment point by point below, indicating the revisions we will incorporate to improve rigor and clarity.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Proposed Method): The claim that noising discrete relevance labels followed by denoising on query-document features 'accurately predict[s] their distribution' lacks any derivation showing that the reverse process recovers the true P(y|q,D) or respects ranking invariants such as transitivity. Standard diffusion is defined on continuous spaces; the paper must specify the embedding/relaxation of ordinal labels and bound the approximation error.

Authors: We agree that the current manuscript presents the approach at a high level without a full derivation. In the revised version we will add a dedicated subsection in §3 deriving the reverse process under a continuous relaxation where ordinal labels are linearly mapped to [0,1]. We will show that the learned denoising approximates the conditional P(y|q,D) via the standard diffusion ELBO, with approximation error bounded by the forward-process variance schedule. Regarding ranking invariants, we will clarify that the model outputs a distribution over scores; the final ranking is obtained by sorting the expected scores, which preserves transitivity by construction. revision: yes
Referee: [§4] §4 (Experiments): No details are provided on the diffusion schedule, the exact form of the denoising network, how discrete labels are mapped into the continuous diffusion process, or any error analysis (e.g., KL divergence to ground-truth ranking distributions). Without these, it is impossible to verify whether the reported effectiveness stems from the generative formulation or from standard LTR components.

Authors: We acknowledge that the experimental section lacks these implementation details. The revised manuscript will expand §4 with the exact diffusion schedule (linear β from 1e-4 to 0.02 over 1000 steps), the denoising network architecture (3-layer MLP with 256 hidden units conditioned on concatenated query-document embeddings), the label mapping procedure (direct scaling of discrete relevance to [0,1]), and quantitative error analysis including KL divergence to empirical ranking distributions where multiple annotations exist. We will also add ablation experiments isolating the generative component from standard LTR baselines. revision: yes
Referee: [§3.2] §3.2 (Reverse Process): The training objective is not shown to be equivalent to maximizing the likelihood of the true ranking distribution; if the denoiser is trained only on a simplified DDPM-style loss, the recovered samples may not correspond to valid permutations or scores for arbitrary queries.

Authors: The objective follows the simplified DDPM loss, which is a variational lower bound rather than exact likelihood maximization. In the revision we will explicitly derive its relation to the conditional ELBO and show that the denoised outputs are valid score distributions (non-negative and summable to one after normalization). We will note that exact permutation sampling is not guaranteed and that rankings are derived from expected scores; this approximation will be discussed as a limitation with supporting empirical checks on validity. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained against external LTR benchmarks

full rationale

The paper introduces DenoiseRank as a generative diffusion approach to LTR by forward-noising relevance labels and reverse-denoising conditioned on query-document features. No equations, fitted parameters, or self-citations are exhibited that reduce any claimed prediction (e.g., recovered ranking distribution) to an input by construction. The central premise applies standard diffusion machinery to a new task domain without renaming known results, importing uniqueness theorems from the same authors, or smuggling ansatzes via prior self-citation. The derivation therefore stands as an independent modeling choice whose validity is to be judged by empirical performance on benchmark datasets rather than by internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5420 in / 979 out tokens · 15467 ms · 2026-05-15T22:12:18.291301+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel denoise rank model, DenoiseRank, which noises the relevant labels in the diffusion process and denoises them on the query documents in the reverse process... L=E t,Y0,pθ [||Y 0 −p θ(D, Yt, t)|| 2]
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our model is the first to address traditional LTR from generative perspective and is a diffusion method for LTR.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

[1]

In Proceedings of the 22nd international conference on Machine learning, pages 89–96

Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, pages 89–96. Christopher JC Burges. 2010. From ranknet to lamb- darank to lambdamart: An overview.Learning, 11(23-581):81. Maarten Buyl, Paul Missault, and Pierre-Antoine Sondag. 2023. Rankformer: Listwise learning-to- rank using listwide labe...

work page 2010
[2]

Jonathan Ho, Ajay Jain, and Pieter Abbeel

Card: Classification and regression diffusion models.Advances in Neural Information Processing Systems, 35:18100–18115. Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. De- noising diffusion probabilistic models.Advances in neural information processing systems, 33:6840– 6851. Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, an...

work page 2020
[3]

Danupat Khamnuansin, Tawunrat Chalothorn, and Ekapol Chuangsuwanich

Lightgbm: A highly efficient gradient boost- ing decision tree.Advances in neural information processing systems, 30. Danupat Khamnuansin, Tawunrat Chalothorn, and Ekapol Chuangsuwanich. 2024. Mrrank: Improv- ing question answering retrieval system through multi-result ranking model.arXiv preprint arXiv:2406.05733. Diederik P Kingma, Max Welling, and 1 ot...

work page arXiv 2024
[4]

Claudio Lucchese, Franco Maria Nardini, Salvatore Or- lando, Raffaele Perego, and Alberto Veneri

Wasserstein generative learning of conditional distribution.arXiv preprint arXiv:2112.10039. Claudio Lucchese, Franco Maria Nardini, Salvatore Or- lando, Raffaele Perego, and Alberto Veneri. 2025. Explainable, effective, and efficient learning-to-rank models using ilmart.ACM Transactions on Informa- tion Systems. Dan Luo, Lixin Zou, Qingyao Ai, Zhiyu Chen...

work page arXiv 2025
[5]

InEuropean Conference on Information Retrieval, pages 156–164

Lit and lean: Distilling listwise rerankers into encoder-decoder models. InEuropean Conference on Information Retrieval, pages 156–164. Springer. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems, 30....

work page arXiv 2017
[6]

A deep generative approach to conditional sampling.Journal of the American Statistical Asso- ciation, 118(543):1837–1848. 11 A Motivation of Adopting Diffusion Models A.1 Weakness of traditional LTR algorithms • Given an input query, existing LTR algorithms tend to produce a ranked list of documents that are consistent over time and lack of diversity. • T...

work page 2017
[7]

A learning rate of 10−3 is optimal for train- ing DenoiseRank on the MS Web30K dataset, with10 −4 being the next best option

work page
[8]

For the Yahoo! and Istella datasets, 10−4 is the better learning rate with which to train DenoiseRank; 10−3 provides an approximate result

work page
[9]

In most situations, learning rates of 10−1 and 10−2 result in poor performance, which sug- gests that our DenoiseRank needs subtle opti- misation. ConvergenceDenoiseRank is a new LTR model consider the task from generative perspective, com- bine with Diffusion model, which need a lot of timesteps in diffusion and reverse process. Thus we investigate the c...

work page
[10]

On the MS Web30K datasets, both Denois- eRank and Rankformer can converge after 50 epocs of training

work page
[11]

On the Yahoo! datasets, DenoiseRank con- verge after 130 epocs, while rankformer is more slow and coverage after 200 epoc

work page
[12]

We have proposed two hypotheses: (1) DenoiseRank demonstrates greater robustness for sparse data (with few effec- tive features)

We speculate it is because: first, documents in Yahoo! have higher dimension of feature (700 dimensions per document) than those in MS Web30K (136 dimensions per document), so model need more epoc to fit them; sec- ond, our DenoiseRank address LTR task from generative perspective and comine with Diffu- sion model, it can fit high dimensional feature more ...

work page
[13]

Effec- tive feature count ranking: YAHOO > ISTLLA > WEB30K

On train.txt, Y AHOO’s NOEF is 224 , ISTLLA’s NOEF is 115 , and WEB30K’s NOEF is 85. Effec- tive feature count ranking: YAHOO > ISTLLA > WEB30K

work page
[14]

On Web30K, DenoiseRank performs excellently in versions with fewer effective features, while the other two models show little difference

work page
[15]

On YAHOO, DenoiseRank and DASALC per- form excellently in versions with fewer effective features, while GBM shows little difference

work page
[16]

On ISTELLA, DenoiseRank performs slightly better in versions with fewer effective features but overall performs poorly. These results suggest that DenoiseRank, being diffusion-based, demonstrates superior learning ca- pabilities for distributions and robustness on sparse features compared to other models. Consequently, it exhibits advantages on WEB30K (wi...

work page
[17]

In contrast, query-document length in Istella presents a hump distribution (max length < 190), and those in Yahoo gradually decreases be- tween 1 and 120 (max length < 140)

the length of query-document in Web30K exhibit a central tendency around 110, following a normal- like distribution, and display characteristics of a long-tail distribution (actually the max length is nearly 1300). In contrast, query-document length in Istella presents a hump distribution (max length < 190), and those in Yahoo gradually decreases be- twee...

work page
[18]

On the WEB30K dataset, DenoiseRank outper- formed the other two models in the medium-length range (50–250), while the difference was negligible in the long-tail range (>250)

work page
[19]

On the YAHOO dataset, DenoiseRank per- formed similarly to LightGBM in the short range (<50) but underperformed compared to LightGBM in the medium-long range (>50), while DASALC showed poorer performance

work page
[20]

On the ISTELLA dataset, DenoiseRank un- derperforms LightGBM across most intervals and slightly trails DASALC in certain ranges (130–160), particularly when @K is reduced. Experimental results indicate that DenoiseRank achieves optimal performance on medium-to-long intervals with sufficient training resources (e.g., its performance on Web30K), while its a...

work page 2020
[21]

The model performs better as the maximum time step increases, suggesting that slow noise addition is more beneficial for model learning

work page
[22]

The model performance is more dependent on long time steps on the web30k dataset

work page
[23]

The model per- forms optimally on the Istella dataset at T=

The performance of the model is not always optimal for long time steps. The model per- forms optimally on the Istella dataset at T=

work page
[24]

E.2 Noise Scheduler The noise scheduler is the way in which the αt changes during diffusion, where αt := Qt s=1αs, see eq

This means that we can reduce the time step appropriately to speed up training and inference. E.2 Noise Scheduler The noise scheduler is the way in which the αt changes during diffusion, where αt := Qt s=1αs, see eq. 4. The rate of change of αt varies in dif- ferent noise-adding schemes, e.g., truncated linear has a large change before T 2 and a small cha...

work page 2023
[25]

TruncatedLinear performs better than the other schedules overall, but there is not a big difference

work page
[26]

TruncatedLinear>Sqrt>Linear>Cosine

the performance of the different noise sched- ules varies greatly on the web30k datasets, i.e. TruncatedLinear>Sqrt>Linear>Cosine

work page
[27]

E.3 The Number of Denoise Network Layers As shown in Figure

on the yahoo and istella datasets, there is not much difference in the reliability of the rank- ing, and on the istella dataset, sqrt even per- forms slightly better than TruncatedLinear. E.3 The Number of Denoise Network Layers As shown in Figure. 1 on the right, the denoising network of DenoiseRank is a feed-forward archi- tecture. The input and output ...

work page 2022
[28]

There is a significant difference between dif- ferent layers on model performance

work page
[29]

On the web30k dataset, layers=2 performs the best, followed by layers=4, and the per- formance decreases instead as the layers in- crease

work page
[30]

On the Yahoo dataset, the model performs sig- nificantly better than 6 and 8 when the layers are 2 and 4

work page
[31]

On the istella dataset, the number of layers has no significant effect on model performance. E.4 Self Attentions In recent studies on learning-to-rank (Pang et al., 2020)(Qin et al., 2021)(Buyl et al., 2023), the self- attention mechanism has been shown to signifi- cantly improve ranking results. To evaluate the ef- fectiveness of self-attention (SA) in D...

work page arXiv 2020
[32]

In real-world information retrieval, the diverse ranked list of items in different search scener- ies can be meaningful for at least 3 reasons:

work page
[33]

Traditional LTR inclines to rank consistently, which lets the ranking result homogenized and trap users in an information cocoon

work page
[34]

This is a way to boost premium content exposure

Tapping into the ‘long-tail ecosystem’. This is a way to boost premium content exposure

work page
[35]

For example:

Traditional LTR tends to fall into local op- tima, DenoiseRank can provide diversity rank results that may be more accurate. For example:

work page
[36]

In self-media community, diverse ranking can provide premium creative content of the long- tail for users, which can encourage new cre- ators

work page
[37]

Unfortunately, previous LTR models did not con- sider uncertainty for ranking and may not rank items diversely

Shopping retrieval on the e-Commerce web- site, we want items with the same relevance scores to have a fair chance to rank higher. Unfortunately, previous LTR models did not con- sider uncertainty for ranking and may not rank items diversely. In this study, we denote diversity in LTR task as: given a query Q and the corresponding docu- ments D, run infere...

work page
[38]

Best performance per column in bold

Among 10 times inferences, the RSD is 0.11, 0.16, 0.28, 0.64 in the top 1,5,10,20 posi- 21 Table 8: NDCG@K Performance with different denoise network depths on Microsoft Web30K, Yahoo!, and Istella datasets. Best performance per column in bold. Layers Web30K Yahoo! Istella @1 @5 @10 @1 @5 @10 @1 @5 @10 251.87 52.52 54.6071.37 74.06 78.42 69.00 69.10 75.69...

work page
[39]

Performance of NDCG@K remains excellent and even slightly increases after repeat infer- ence, which means that our DenoiseRank can produce diverse ranked lists while guarantees reliability of ranking result

work page
[40]

It proved our extrapolate that traditional LTR models do not inject un- certainty which results in a static ranking se- quence

Rankformer did not present the ability to rank in different order, the RSD is 0.1 regardless of the K poisition. It proved our extrapolate that traditional LTR models do not inject un- certainty which results in a static ranking se- quence

work page
[41]

According to the above analysis, our DenoiseR- ank can be applied to areas requiring diverse rank- ing sequences of items

Although the NDCG remains excellent on av- erage while enhancing diversity, the devia- tion results also indicate that isolated extreme cases may occur, leading to either low or high NDCG@K. According to the above analysis, our DenoiseR- ank can be applied to areas requiring diverse rank- ing sequences of items. Our novel metric, RSD, can also be used to ...

work page
[42]

RMSE: a typical point-wise loss: LRMSE(Y, ˆY) = q 1 n Pn i=1(Yi − ˆYi)2

work page
[43]

RankNet(Burges et al., 2005): a clas- sic pair-wise loss: LRankNet(Y, ˆY) =P Yi>Yj loge(1 +e ˆYj − ˆYi)

work page 2005
[44]

NDCGLoss2++(Wang et al., 2018): a NDCG metric-driven loss functions based on the lambdaLoss probabilistic framework: LNDCGLoss2++(Y, ˆY) =− X Yi>Yj log2 X π ( 1 1 +e −σ( ˆYi− ˆYj ) )(ρij +µδij )|Gi−Gj |H(π| ˆY), where Gi = 2yi −1 maxDCG, ρij =| 1 Di − 1 Dj |, δij = | 1 D|i−j| − 1 D|i−j| +1 |, Di = log 2(1 +i) , and H(π| ˆY) is a hard assignment distributi...

work page 2018
[45]

ApproxNDCG(Qin et al., 2010)(Bruch et al., 2019): a loss that designed to be approximation of NDCG metrics, LApproxNDCG(Y, ˆY) = 1 Z Pn i=1 G(Yi) log2(1+π(i)) , where Z=−DCG(π ∗, Y) , G(Yi) = 2 Yi −1 and π(i) = 1 2 +P j sigmoid( ˆYj − ˆYi T ), T is a smooth parameter

work page 2010
[46]

ListNet(Cao et al., 2007): a clas- sic list-wise loss: LListNet(Y, ˆY) = −Pn i=1 Yi loge e ˆYi P j e ˆYj

work page 2007
[47]

For different loss functions, we use AdamW optimizer and scan learning rate∈0.01,0.001,0.0001

MSE (Ho et al., 2020)(Nichol and Dhari- wal, 2021): a loss function use in DDPMs to predict x0 or ϵ, here we formulate it as LMSE(Y, ˆY) =E[||Y− ˆY|| 2] We report the results based on the best NDCG@10 for different losses. For different loss functions, we use AdamW optimizer and scan learning rate∈0.01,0.001,0.0001. We try to find the best performance of ...

work page 2020
[48]

DenoiseRank, when trained with MSE, RMSE and ListNet, achieves first-tier performance and is far superior to the rest

work page
[49]

Though ApproxNDCG improves the perfor- mance of neural LTR models in the original papers, it does not seem to work well on De- noisRank, which is implemented from a gen- erative perspective

work page
[50]

How- ever, for the Yahoo! and Istella datasets, train- ing with MSE loss is the best choice

DenoiseRank, when trained with ListNet, per- forms the best on the Web30K dataset. How- ever, for the Yahoo! and Istella datasets, train- ing with MSE loss is the best choice. H Other Metrics In order to evaluate our denoiseRank fully, we use another 4 types of ranking metrics, including Ex- pected Reciprocal Rank (ERR), Mean Average Pre- cision (MAP), Me...

work page
[51]

As shown in Table 15, the inference time of the DenoiseRank with a smaller reverse step 23 Table 10: NDCG@K and RSD@(K,M) performance of DenoiseRank and Rankformer on Microsoft Web30K datasets. Model M RSD@(K,M) NDCG@K 1 5 10 20 1 5 10 20 Rankformer 1 – – – – 49.62 49.30 51.42 54.29 10 0.1 0.1 0.1 0.1 49.62 49.30 51.42 54.29 DenoiseRank 1 – – – – 51.48 52...

work page arXiv
[52]

Therefore, it is necessary to strike a balance between time and performance and to use the appropriate time step in practice

According to Table 13, DenoiseRank’s inference time increases quickly as the reverse step increases; however, NDCG per- formance fluctuates(Table 14). Therefore, it is necessary to strike a balance between time and performance and to use the appropriate time step in practice

work page
[53]

Compared to neural network-based models, tree-based models require very little inference time due to their computational complexity and algorithmic properties. Therefore, tree- based models should be used in systems re- quiring a high response speed, and Denois- eRank should be used in scenarios involving complex representations and user feedback distribution

work page
[54]

The time processed by the tree- based models is significantly lower than that by the neural network-based models, which is due to their more light-weight model struc- ture and size

In Table 15, we report the training time of the baselines. The time processed by the tree- based models is significantly lower than that by the neural network-based models, which is due to their more light-weight model struc- ture and size. Additionally, DenoiseRank con- sumes medium training time in the neural net- work model. Although the forward proces...

work page

[1] [1]

In Proceedings of the 22nd international conference on Machine learning, pages 89–96

Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, pages 89–96. Christopher JC Burges. 2010. From ranknet to lamb- darank to lambdamart: An overview.Learning, 11(23-581):81. Maarten Buyl, Paul Missault, and Pierre-Antoine Sondag. 2023. Rankformer: Listwise learning-to- rank using listwide labe...

work page 2010

[2] [2]

Jonathan Ho, Ajay Jain, and Pieter Abbeel

Card: Classification and regression diffusion models.Advances in Neural Information Processing Systems, 35:18100–18115. Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. De- noising diffusion probabilistic models.Advances in neural information processing systems, 33:6840– 6851. Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, an...

work page 2020

[3] [3]

Danupat Khamnuansin, Tawunrat Chalothorn, and Ekapol Chuangsuwanich

Lightgbm: A highly efficient gradient boost- ing decision tree.Advances in neural information processing systems, 30. Danupat Khamnuansin, Tawunrat Chalothorn, and Ekapol Chuangsuwanich. 2024. Mrrank: Improv- ing question answering retrieval system through multi-result ranking model.arXiv preprint arXiv:2406.05733. Diederik P Kingma, Max Welling, and 1 ot...

work page arXiv 2024

[4] [4]

Claudio Lucchese, Franco Maria Nardini, Salvatore Or- lando, Raffaele Perego, and Alberto Veneri

Wasserstein generative learning of conditional distribution.arXiv preprint arXiv:2112.10039. Claudio Lucchese, Franco Maria Nardini, Salvatore Or- lando, Raffaele Perego, and Alberto Veneri. 2025. Explainable, effective, and efficient learning-to-rank models using ilmart.ACM Transactions on Informa- tion Systems. Dan Luo, Lixin Zou, Qingyao Ai, Zhiyu Chen...

work page arXiv 2025

[5] [5]

InEuropean Conference on Information Retrieval, pages 156–164

Lit and lean: Distilling listwise rerankers into encoder-decoder models. InEuropean Conference on Information Retrieval, pages 156–164. Springer. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems, 30....

work page arXiv 2017

[6] [6]

A deep generative approach to conditional sampling.Journal of the American Statistical Asso- ciation, 118(543):1837–1848. 11 A Motivation of Adopting Diffusion Models A.1 Weakness of traditional LTR algorithms • Given an input query, existing LTR algorithms tend to produce a ranked list of documents that are consistent over time and lack of diversity. • T...

work page 2017

[7] [7]

A learning rate of 10−3 is optimal for train- ing DenoiseRank on the MS Web30K dataset, with10 −4 being the next best option

work page

[8] [8]

For the Yahoo! and Istella datasets, 10−4 is the better learning rate with which to train DenoiseRank; 10−3 provides an approximate result

work page

[9] [9]

In most situations, learning rates of 10−1 and 10−2 result in poor performance, which sug- gests that our DenoiseRank needs subtle opti- misation. ConvergenceDenoiseRank is a new LTR model consider the task from generative perspective, com- bine with Diffusion model, which need a lot of timesteps in diffusion and reverse process. Thus we investigate the c...

work page

[10] [10]

On the MS Web30K datasets, both Denois- eRank and Rankformer can converge after 50 epocs of training

work page

[11] [11]

On the Yahoo! datasets, DenoiseRank con- verge after 130 epocs, while rankformer is more slow and coverage after 200 epoc

work page

[12] [12]

We have proposed two hypotheses: (1) DenoiseRank demonstrates greater robustness for sparse data (with few effec- tive features)

We speculate it is because: first, documents in Yahoo! have higher dimension of feature (700 dimensions per document) than those in MS Web30K (136 dimensions per document), so model need more epoc to fit them; sec- ond, our DenoiseRank address LTR task from generative perspective and comine with Diffu- sion model, it can fit high dimensional feature more ...

work page

[13] [13]

Effec- tive feature count ranking: YAHOO > ISTLLA > WEB30K

On train.txt, Y AHOO’s NOEF is 224 , ISTLLA’s NOEF is 115 , and WEB30K’s NOEF is 85. Effec- tive feature count ranking: YAHOO > ISTLLA > WEB30K

work page

[14] [14]

On Web30K, DenoiseRank performs excellently in versions with fewer effective features, while the other two models show little difference

work page

[15] [15]

On YAHOO, DenoiseRank and DASALC per- form excellently in versions with fewer effective features, while GBM shows little difference

work page

[16] [16]

On ISTELLA, DenoiseRank performs slightly better in versions with fewer effective features but overall performs poorly. These results suggest that DenoiseRank, being diffusion-based, demonstrates superior learning ca- pabilities for distributions and robustness on sparse features compared to other models. Consequently, it exhibits advantages on WEB30K (wi...

work page

[17] [17]

In contrast, query-document length in Istella presents a hump distribution (max length < 190), and those in Yahoo gradually decreases be- tween 1 and 120 (max length < 140)

the length of query-document in Web30K exhibit a central tendency around 110, following a normal- like distribution, and display characteristics of a long-tail distribution (actually the max length is nearly 1300). In contrast, query-document length in Istella presents a hump distribution (max length < 190), and those in Yahoo gradually decreases be- twee...

work page

[18] [18]

On the WEB30K dataset, DenoiseRank outper- formed the other two models in the medium-length range (50–250), while the difference was negligible in the long-tail range (>250)

work page

[19] [19]

On the YAHOO dataset, DenoiseRank per- formed similarly to LightGBM in the short range (<50) but underperformed compared to LightGBM in the medium-long range (>50), while DASALC showed poorer performance

work page

[20] [20]

On the ISTELLA dataset, DenoiseRank un- derperforms LightGBM across most intervals and slightly trails DASALC in certain ranges (130–160), particularly when @K is reduced. Experimental results indicate that DenoiseRank achieves optimal performance on medium-to-long intervals with sufficient training resources (e.g., its performance on Web30K), while its a...

work page 2020

[21] [21]

The model performs better as the maximum time step increases, suggesting that slow noise addition is more beneficial for model learning

work page

[22] [22]

The model performance is more dependent on long time steps on the web30k dataset

work page

[23] [23]

The model per- forms optimally on the Istella dataset at T=

The performance of the model is not always optimal for long time steps. The model per- forms optimally on the Istella dataset at T=

work page

[24] [24]

E.2 Noise Scheduler The noise scheduler is the way in which the αt changes during diffusion, where αt := Qt s=1αs, see eq

This means that we can reduce the time step appropriately to speed up training and inference. E.2 Noise Scheduler The noise scheduler is the way in which the αt changes during diffusion, where αt := Qt s=1αs, see eq. 4. The rate of change of αt varies in dif- ferent noise-adding schemes, e.g., truncated linear has a large change before T 2 and a small cha...

work page 2023

[25] [25]

TruncatedLinear performs better than the other schedules overall, but there is not a big difference

work page

[26] [26]

TruncatedLinear>Sqrt>Linear>Cosine

the performance of the different noise sched- ules varies greatly on the web30k datasets, i.e. TruncatedLinear>Sqrt>Linear>Cosine

work page

[27] [27]

E.3 The Number of Denoise Network Layers As shown in Figure

on the yahoo and istella datasets, there is not much difference in the reliability of the rank- ing, and on the istella dataset, sqrt even per- forms slightly better than TruncatedLinear. E.3 The Number of Denoise Network Layers As shown in Figure. 1 on the right, the denoising network of DenoiseRank is a feed-forward archi- tecture. The input and output ...

work page 2022

[28] [28]

There is a significant difference between dif- ferent layers on model performance

work page

[29] [29]

On the web30k dataset, layers=2 performs the best, followed by layers=4, and the per- formance decreases instead as the layers in- crease

work page

[30] [30]

On the Yahoo dataset, the model performs sig- nificantly better than 6 and 8 when the layers are 2 and 4

work page

[31] [31]

On the istella dataset, the number of layers has no significant effect on model performance. E.4 Self Attentions In recent studies on learning-to-rank (Pang et al., 2020)(Qin et al., 2021)(Buyl et al., 2023), the self- attention mechanism has been shown to signifi- cantly improve ranking results. To evaluate the ef- fectiveness of self-attention (SA) in D...

work page arXiv 2020

[32] [32]

In real-world information retrieval, the diverse ranked list of items in different search scener- ies can be meaningful for at least 3 reasons:

work page

[33] [33]

Traditional LTR inclines to rank consistently, which lets the ranking result homogenized and trap users in an information cocoon

work page

[34] [34]

This is a way to boost premium content exposure

Tapping into the ‘long-tail ecosystem’. This is a way to boost premium content exposure

work page

[35] [35]

For example:

Traditional LTR tends to fall into local op- tima, DenoiseRank can provide diversity rank results that may be more accurate. For example:

work page

[36] [36]

In self-media community, diverse ranking can provide premium creative content of the long- tail for users, which can encourage new cre- ators

work page

[37] [37]

Unfortunately, previous LTR models did not con- sider uncertainty for ranking and may not rank items diversely

Shopping retrieval on the e-Commerce web- site, we want items with the same relevance scores to have a fair chance to rank higher. Unfortunately, previous LTR models did not con- sider uncertainty for ranking and may not rank items diversely. In this study, we denote diversity in LTR task as: given a query Q and the corresponding docu- ments D, run infere...

work page

[38] [38]

Best performance per column in bold

Among 10 times inferences, the RSD is 0.11, 0.16, 0.28, 0.64 in the top 1,5,10,20 posi- 21 Table 8: NDCG@K Performance with different denoise network depths on Microsoft Web30K, Yahoo!, and Istella datasets. Best performance per column in bold. Layers Web30K Yahoo! Istella @1 @5 @10 @1 @5 @10 @1 @5 @10 251.87 52.52 54.6071.37 74.06 78.42 69.00 69.10 75.69...

work page

[39] [39]

Performance of NDCG@K remains excellent and even slightly increases after repeat infer- ence, which means that our DenoiseRank can produce diverse ranked lists while guarantees reliability of ranking result

work page

[40] [40]

It proved our extrapolate that traditional LTR models do not inject un- certainty which results in a static ranking se- quence

Rankformer did not present the ability to rank in different order, the RSD is 0.1 regardless of the K poisition. It proved our extrapolate that traditional LTR models do not inject un- certainty which results in a static ranking se- quence

work page

[41] [41]

According to the above analysis, our DenoiseR- ank can be applied to areas requiring diverse rank- ing sequences of items

Although the NDCG remains excellent on av- erage while enhancing diversity, the devia- tion results also indicate that isolated extreme cases may occur, leading to either low or high NDCG@K. According to the above analysis, our DenoiseR- ank can be applied to areas requiring diverse rank- ing sequences of items. Our novel metric, RSD, can also be used to ...

work page

[42] [42]

RMSE: a typical point-wise loss: LRMSE(Y, ˆY) = q 1 n Pn i=1(Yi − ˆYi)2

work page

[43] [43]

RankNet(Burges et al., 2005): a clas- sic pair-wise loss: LRankNet(Y, ˆY) =P Yi>Yj loge(1 +e ˆYj − ˆYi)

work page 2005

[44] [44]

NDCGLoss2++(Wang et al., 2018): a NDCG metric-driven loss functions based on the lambdaLoss probabilistic framework: LNDCGLoss2++(Y, ˆY) =− X Yi>Yj log2 X π ( 1 1 +e −σ( ˆYi− ˆYj ) )(ρij +µδij )|Gi−Gj |H(π| ˆY), where Gi = 2yi −1 maxDCG, ρij =| 1 Di − 1 Dj |, δij = | 1 D|i−j| − 1 D|i−j| +1 |, Di = log 2(1 +i) , and H(π| ˆY) is a hard assignment distributi...

work page 2018

[45] [45]

ApproxNDCG(Qin et al., 2010)(Bruch et al., 2019): a loss that designed to be approximation of NDCG metrics, LApproxNDCG(Y, ˆY) = 1 Z Pn i=1 G(Yi) log2(1+π(i)) , where Z=−DCG(π ∗, Y) , G(Yi) = 2 Yi −1 and π(i) = 1 2 +P j sigmoid( ˆYj − ˆYi T ), T is a smooth parameter

work page 2010

[46] [46]

ListNet(Cao et al., 2007): a clas- sic list-wise loss: LListNet(Y, ˆY) = −Pn i=1 Yi loge e ˆYi P j e ˆYj

work page 2007

[47] [47]

For different loss functions, we use AdamW optimizer and scan learning rate∈0.01,0.001,0.0001

MSE (Ho et al., 2020)(Nichol and Dhari- wal, 2021): a loss function use in DDPMs to predict x0 or ϵ, here we formulate it as LMSE(Y, ˆY) =E[||Y− ˆY|| 2] We report the results based on the best NDCG@10 for different losses. For different loss functions, we use AdamW optimizer and scan learning rate∈0.01,0.001,0.0001. We try to find the best performance of ...

work page 2020

[48] [48]

DenoiseRank, when trained with MSE, RMSE and ListNet, achieves first-tier performance and is far superior to the rest

work page

[49] [49]

Though ApproxNDCG improves the perfor- mance of neural LTR models in the original papers, it does not seem to work well on De- noisRank, which is implemented from a gen- erative perspective

work page

[50] [50]

How- ever, for the Yahoo! and Istella datasets, train- ing with MSE loss is the best choice

DenoiseRank, when trained with ListNet, per- forms the best on the Web30K dataset. How- ever, for the Yahoo! and Istella datasets, train- ing with MSE loss is the best choice. H Other Metrics In order to evaluate our denoiseRank fully, we use another 4 types of ranking metrics, including Ex- pected Reciprocal Rank (ERR), Mean Average Pre- cision (MAP), Me...

work page

[51] [51]

As shown in Table 15, the inference time of the DenoiseRank with a smaller reverse step 23 Table 10: NDCG@K and RSD@(K,M) performance of DenoiseRank and Rankformer on Microsoft Web30K datasets. Model M RSD@(K,M) NDCG@K 1 5 10 20 1 5 10 20 Rankformer 1 – – – – 49.62 49.30 51.42 54.29 10 0.1 0.1 0.1 0.1 49.62 49.30 51.42 54.29 DenoiseRank 1 – – – – 51.48 52...

work page arXiv

[52] [52]

Therefore, it is necessary to strike a balance between time and performance and to use the appropriate time step in practice

According to Table 13, DenoiseRank’s inference time increases quickly as the reverse step increases; however, NDCG per- formance fluctuates(Table 14). Therefore, it is necessary to strike a balance between time and performance and to use the appropriate time step in practice

work page

[53] [53]

Compared to neural network-based models, tree-based models require very little inference time due to their computational complexity and algorithmic properties. Therefore, tree- based models should be used in systems re- quiring a high response speed, and Denois- eRank should be used in scenarios involving complex representations and user feedback distribution

work page

[54] [54]

The time processed by the tree- based models is significantly lower than that by the neural network-based models, which is due to their more light-weight model struc- ture and size

In Table 15, we report the training time of the baselines. The time processed by the tree- based models is significantly lower than that by the neural network-based models, which is due to their more light-weight model struc- ture and size. Additionally, DenoiseRank con- sumes medium training time in the neural net- work model. Although the forward proces...

work page