Surprise-Guided MergeSort: Budget-Efficient Human-in-the-Loop Ranking via Adaptive Comparison Scheduling
Pith reviewed 2026-06-30 11:18 UTC · model grok-4.3
The pith
Surprise-Guided MergeSort skips up to 535 human comparisons per session while raising Kendall tau by 6 to 12 points over Active Elo.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Surprise-Guided MergeSort integrates a bottom-up MergeSort scheduler, a composite Surprise Scorer, and an adaptive budget allocator that sends only high-surprise pairs to humans and automates the rest via transitivity inference, yielding higher Kendall tau under fixed annotation budgets on STS-B, BIOSSES, SICKR-STS, KonIQ-10k, TID2013, and LIVE Challenge.
What carries the argument
The composite Surprise Scorer, which combines position-bias-cancelled VLM confidence, Elo gap, and vote entropy to measure comparison ambiguity and decide human versus automated routing.
If this is right
- Up to 535 non-informative comparisons can be skipped per session without human input.
- Kendall's τ×100 improves by +6 to +12 compared to Active Elo under the same total budget.
- The accuracy-efficiency trade-off holds across both text similarity and image quality assessment domains.
- VLM-guided surprise metrics plus sorting structure outperform prior active comparison schedulers on the tested benchmarks.
Where Pith is reading between the lines
- The same surprise-based routing could be tested on preference data collection for language model alignment.
- Replacing the VLM component with a domain-specific model might extend the approach beyond vision-language tasks.
- Scaling experiments on datasets larger than the current six benchmarks would show whether skipped-comparison counts grow with n.
Load-bearing premise
The composite Surprise Scorer reliably identifies comparisons whose outcome can be safely inferred by transitivity without introducing ranking errors.
What would settle it
Run SGS to completion on a dataset with complete ground-truth rankings, then count how many inferred comparisons disagree with the true order and whether those disagreements lower final Kendall tau.
read the original abstract
Pairwise comparison is the gold standard for subjective ranking tasks; however, exhaustive annotation requires a massive number of human comparisons ($O(n^2)$). While sorting-based methods have reduced this burden to $O(n\log n)$, they still require expensive human judgment for every single comparison. To further improve annotation efficiency, we propose leveraging a Vision-Language Model (VLM) not as an annotator replacement, but as a \emph{question prioritizer} to identify which comparisons genuinely require human judgment. The proposed \textbf{Surprise-Guided MergeSort (SGS)} framework achieves this through three integrated components: (1) a bottom-up MergeSort scheduler that structures comparisons and exploits transitivity, (2) a composite Surprise Scorer -- combining position-bias-cancelled VLM confidence, Elo gap, and vote entropy -- to quantify comparison ambiguity, and (3) an adaptive budget allocator that routes high-surprise pairs to humans while automating low-surprise pairs via transitivity inference. Validation was conducted on six diverse benchmarks spanning text similarity (STS-B, BIOSSES, SICKR-STS) and image quality assessment (KonIQ-10k, TID2013, LIVE Challenge). SGS effectively identified and skipped up to 535 non-informative comparisons per session. Consequently, it achieved Kendall's $\tau{\times}100$ improvements of $+6$ to $+12$ over Active Elo under the same total budget. These results demonstrate that combining VLM-guided surprise metrics with algorithmic sorting provides a generally consistent accuracy-efficiency trade-off across diverse domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Surprise-Guided MergeSort (SGS), which augments a bottom-up MergeSort scheduler with a composite Surprise Scorer (position-bias-cancelled VLM confidence + Elo gap + vote entropy) and an adaptive budget allocator. High-surprise pairs are routed to humans while low-surprise pairs are inferred via transitivity, with the goal of reducing human comparisons below the O(n log n) baseline. On six benchmarks (STS-B, BIOSSES, SICKR-STS for text; KonIQ-10k, TID2013, LIVE Challenge for images), SGS reports skipping up to 535 comparisons per session and Kendall's τ×100 gains of +6 to +12 versus Active Elo under identical total budget.
Significance. If the transitivity inferences prove reliable, the method offers a practical way to allocate limited human budget in subjective ranking by using VLMs only for prioritization. The evaluation across six diverse benchmarks and the explicit comparison against an independent Active Elo baseline are strengths; the algorithmic exploitation of MergeSort structure plus ML guidance is a clear contribution if the safety assumption holds.
major comments (2)
- [Abstract] Abstract: the central claim that SGS achieves net τ gains by safely automating low-surprise pairs via transitivity is load-bearing, yet no direct metric (e.g., disagreement rate between inferred outcomes and held-out human labels) or propagation analysis through the bottom-up merge steps is reported.
- [Abstract] Abstract: the reported +6 to +12 τ×100 improvements lack error bars, ablation of the three Surprise Scorer components, and any validation of how VLM confidence is computed, so it is impossible to determine whether the gains are robust or driven by post-hoc selection of automatable pairs.
minor comments (1)
- The manuscript would benefit from a table or figure that breaks down the number of skipped comparisons and the resulting τ per benchmark, including standard deviations across runs.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for stronger validation of the transitivity mechanism and robustness of the reported gains. We address each point below and will incorporate the suggested analyses in the revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that SGS achieves net τ gains by safely automating low-surprise pairs via transitivity is load-bearing, yet no direct metric (e.g., disagreement rate between inferred outcomes and held-out human labels) or propagation analysis through the bottom-up merge steps is reported.
Authors: We agree that an explicit disagreement rate between transitivity-inferred outcomes and held-out human labels, along with propagation analysis across merge steps, would provide direct evidence for the safety of automated pairs. The current evaluation relies on end-to-end Kendall's τ under fixed budget as an implicit validation, since systematic inference errors would necessarily degrade final ranking quality relative to the Active Elo baseline. We will add both the disagreement metric (computed on a held-out human-labeled subset) and a step-wise propagation analysis in the revised manuscript. revision: yes
-
Referee: [Abstract] Abstract: the reported +6 to +12 τ×100 improvements lack error bars, ablation of the three Surprise Scorer components, and any validation of how VLM confidence is computed, so it is impossible to determine whether the gains are robust or driven by post-hoc selection of automatable pairs.
Authors: The reported gains are observed consistently across all six benchmarks, which provides some indication of robustness, but we acknowledge that the absence of error bars, component ablations, and explicit VLM confidence validation leaves open the possibility of post-hoc effects. We will add (i) error bars from multiple independent runs, (ii) ablations isolating each Surprise Scorer component (position-bias-cancelled VLM confidence, Elo gap, vote entropy), and (iii) details on the VLM confidence computation procedure to allow readers to assess whether gains are driven by the proposed scoring rather than selective automation. revision: yes
Circularity Check
No significant circularity; empirical evaluation against independent baseline
full rationale
The paper presents SGS as an empirical scheduling method that combines a standard bottom-up MergeSort, a composite Surprise Scorer (VLM confidence + Elo gap + vote entropy), and an adaptive allocator to skip comparisons via transitivity. Reported gains (+6 to +12 Kendall's τ×100 vs Active Elo) are measured on six external benchmarks under fixed budget, with no equations, fitted parameters, or self-citations shown to reduce the central performance claim to the inputs by construction. The transitivity assumption is algorithmic rather than redefined, and the evaluation uses an independent baseline without evidence of test-set fitting. This is a self-contained empirical result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transitivity holds sufficiently often in the target ranking domains that low-surprise pairs can be inferred without error
Reference graph
Works this paper leans on
-
[1]
In: Advances in Neural Infor- mation Processing Systems (2024)
Bergström, H., Carlsson, E., Dubhashi, D., Johansson, F.D.: Active Preference Learning for Ordering Items In- and Out-of-Sample. In: Advances in Neural Infor- mation Processing Systems (2024)
2024
-
[2]
The Method of Paired Comparisons
Bradley, R.A., Terry, M.E.: Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika39(3/4), 324–345 (1952)
1952
-
[3]
In: Proceedings of the 11th International Workshop on Semantic Evaluation (2017)
Cer, D., et al.: SemEval-2017 Task 1: Semantic Textual Similarity. In: Proceedings of the 11th International Workshop on Semantic Evaluation (2017)
2017
-
[4]
In: Pro- ceedings of the 22nd International Conference on Machine Learning
Chu, W., Ghahramani, Z.: Preference Learning with Gaussian Processes. In: Pro- ceedings of the 22nd International Conference on Machine Learning. pp. 137–144 (2005)
2005
-
[5]
IEEE Transactions on Image Processing25(1), 372–387 (2016) 14 Y
Ghadiyaram, D., Bovik, A.C.: Massive Online Crowdsourced Study of Subjective and Objective Picture Quality. IEEE Transactions on Image Processing25(1), 372–387 (2016) 14 Y. Park et al
2016
-
[6]
IEEE Transactions on Image Processing29, 4041–4056 (2020)
Hosu, V., et al.: KonIQ-10k: An Ecologically Valid Database for Deep Learning of Blind Image Quality Assessment. IEEE Transactions on Image Processing29, 4041–4056 (2020)
2020
-
[7]
Bayesian Active Learning for Classification and Preference Learning
Houlsby, N., Huszár, F., Ghahramani, Z., Lengyel, M.: Bayesian Active Learning for Classification and Preference Learning. arXiv preprint arXiv:1112.5745 (2012)
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[8]
In: Advances in Neural Information Processing Systems
Jamieson, K.G., Nowak, R.D.: Active Ranking Using Pairwise Comparisons. In: Advances in Neural Information Processing Systems. pp. 2240–2248 (2011)
2011
-
[9]
arXiv preprint arXiv:2202.04823 (2022)
Jang, I., Danley, G., Chang, K., Kalpathy-Cramer, J.: Decreasing Annotation Burden of Pairwise Comparisons with Human-in-the-Loop Sorting. arXiv preprint arXiv:2202.04823 (2022)
-
[10]
In: Proceedings of the Conference on Language Modeling (2024)
Liu, T., Zheng, J., Fei, H.: PairS: Pairwise Sequence Ranking with Merge Sort and LLM Uncertainty. In: Proceedings of the Conference on Language Modeling (2024)
2024
-
[11]
arXiv preprint arXiv:2505.24643 (2025)
Luo, T., et al.: Are Optimal Algorithms Still Optimal? Sorting and Searching with LLMs. arXiv preprint arXiv:2505.24643 (2025)
-
[12]
In: Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (2024)
Ma, Y., et al.: Active Ranking with Effective Resistance-Based Pair Selection. In: Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (2024)
2024
-
[13]
In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (2014)
Marelli, M., et al.: A SICK Cure for the Evaluation of Compositional Distribu- tional Semantic Models. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (2014)
2014
-
[14]
In: Proceedings of the 34th International Conference on Machine Learning
Maystre, L., Grossglauser, M.: Just Sort It! A Simple and Effective Approach to Active Preference Learning. In: Proceedings of the 34th International Conference on Machine Learning. pp. 2344–2353 (2017)
2017
-
[15]
In: Pro- ceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (2023)
Muennighoff, N., et al.: MTEB: Massive Text Embedding Benchmark. In: Pro- ceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (2023)
2023
-
[16]
In: Proceedings of the 34th ACM International Conference on Information and Knowledge Management
Park, Y., Chung, H., Jang, I.: EZ-Sort: Efficient Pairwise Comparison via Zero-Shot CLIP-Based Pre-Ordering and Human-in-the-Loop Sorting. In: Proceedings of the 34th ACM International Conference on Information and Knowledge Management. pp. 5120–5124 (2025)
2025
-
[17]
In: Pacific-Asia Conference on Knowledge Discovery and Data Mining
Park, Y., Chung, H., Jang, I.: Dodgersort: Uncertainty-aware vlm-guided human- in-the-loop pairwise ranking. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. pp. 461–473. Springer (2026)
2026
-
[18]
MetaRanker: Human-in-the-loop Active Ranking for Metalens Image Quality
Park, Y., Chung, H., Jang, I.: MetaRanker: Human-in-the-loop Active Ranking for Metalens Image Quality. arXiv preprint arXiv:2605.29212 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Signal Processing: Image Communication30, 57–77 (2015)
Ponomarenko, N., et al.: Image Database TID2013. Signal Processing: Image Communication30, 57–77 (2015)
2015
-
[20]
In: Findings of the Association for Computational Linguistics: NAACL 2024 (2024)
Qin, Z., et al.: Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. In: Findings of the Association for Computational Linguistics: NAACL 2024 (2024)
2024
-
[21]
In: Proceedings of the 38th International Conference on Machine Learning
Radford, A., et al.: Learning Transferable Visual Models from Natural Language Supervision. In: Proceedings of the 38th International Conference on Machine Learning. pp. 8748–8763 (2021)
2021
-
[22]
arXiv preprint arXiv:2406.07791 (2024)
Shi, Y., et al.: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2406.07791 (2024)
-
[23]
Ozt"urk, H.,
Sogancioglu, G., "Ozt"urk, H., "Ozg"ur, A.: BIOSSES: A Semantic Sentence Similarity Estimation System for the Biomedical Domain. Bioinformatics33(14), i49–i58 (2017)
2017
-
[24]
Psychological Review34(4), 273–286 (1927) Surprise-Guided MergeSort 15
Thurstone, L.L.: A Law of Comparative Judgment. Psychological Review34(4), 273–286 (1927) Surprise-Guided MergeSort 15
1927
-
[25]
In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (2024)
Wang, P., et al.: Large Language Models are Not Fair Evaluators. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (2024)
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.