pith. sign in

arxiv: 2606.15623 · v4 · pith:ARXE3OKTnew · submitted 2026-06-14 · 💻 cs.LG · cs.AI

Surprise-Guided MergeSort: Budget-Efficient Human-in-the-Loop Ranking via Adaptive Comparison Scheduling

Pith reviewed 2026-06-30 11:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords human-in-the-loop rankingpairwise comparisonMergeSortvision-language modelactive learningKendall taubudget-efficient annotationtransitivity inference
0
0 comments X

The pith

Surprise-Guided MergeSort skips up to 535 human comparisons per session while raising Kendall tau by 6 to 12 points over Active Elo.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Surprise-Guided MergeSort to cut the cost of pairwise ranking by deciding which comparisons truly need human input. It builds comparisons bottom-up with MergeSort to exploit transitivity, then uses a composite scorer of VLM confidence, Elo gap, and vote entropy to flag ambiguous pairs. Low-surprise pairs are inferred automatically instead of shown to humans. Across six text and image benchmarks this produced consistent gains in ranking quality for the same total human budget.

Core claim

Surprise-Guided MergeSort integrates a bottom-up MergeSort scheduler, a composite Surprise Scorer, and an adaptive budget allocator that sends only high-surprise pairs to humans and automates the rest via transitivity inference, yielding higher Kendall tau under fixed annotation budgets on STS-B, BIOSSES, SICKR-STS, KonIQ-10k, TID2013, and LIVE Challenge.

What carries the argument

The composite Surprise Scorer, which combines position-bias-cancelled VLM confidence, Elo gap, and vote entropy to measure comparison ambiguity and decide human versus automated routing.

If this is right

  • Up to 535 non-informative comparisons can be skipped per session without human input.
  • Kendall's τ×100 improves by +6 to +12 compared to Active Elo under the same total budget.
  • The accuracy-efficiency trade-off holds across both text similarity and image quality assessment domains.
  • VLM-guided surprise metrics plus sorting structure outperform prior active comparison schedulers on the tested benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same surprise-based routing could be tested on preference data collection for language model alignment.
  • Replacing the VLM component with a domain-specific model might extend the approach beyond vision-language tasks.
  • Scaling experiments on datasets larger than the current six benchmarks would show whether skipped-comparison counts grow with n.

Load-bearing premise

The composite Surprise Scorer reliably identifies comparisons whose outcome can be safely inferred by transitivity without introducing ranking errors.

What would settle it

Run SGS to completion on a dataset with complete ground-truth rankings, then count how many inferred comparisons disagree with the true order and whether those disagreements lower final Kendall tau.

read the original abstract

Pairwise comparison is the gold standard for subjective ranking tasks; however, exhaustive annotation requires a massive number of human comparisons ($O(n^2)$). While sorting-based methods have reduced this burden to $O(n\log n)$, they still require expensive human judgment for every single comparison. To further improve annotation efficiency, we propose leveraging a Vision-Language Model (VLM) not as an annotator replacement, but as a \emph{question prioritizer} to identify which comparisons genuinely require human judgment. The proposed \textbf{Surprise-Guided MergeSort (SGS)} framework achieves this through three integrated components: (1) a bottom-up MergeSort scheduler that structures comparisons and exploits transitivity, (2) a composite Surprise Scorer -- combining position-bias-cancelled VLM confidence, Elo gap, and vote entropy -- to quantify comparison ambiguity, and (3) an adaptive budget allocator that routes high-surprise pairs to humans while automating low-surprise pairs via transitivity inference. Validation was conducted on six diverse benchmarks spanning text similarity (STS-B, BIOSSES, SICKR-STS) and image quality assessment (KonIQ-10k, TID2013, LIVE Challenge). SGS effectively identified and skipped up to 535 non-informative comparisons per session. Consequently, it achieved Kendall's $\tau{\times}100$ improvements of $+6$ to $+12$ over Active Elo under the same total budget. These results demonstrate that combining VLM-guided surprise metrics with algorithmic sorting provides a generally consistent accuracy-efficiency trade-off across diverse domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Surprise-Guided MergeSort (SGS), which augments a bottom-up MergeSort scheduler with a composite Surprise Scorer (position-bias-cancelled VLM confidence + Elo gap + vote entropy) and an adaptive budget allocator. High-surprise pairs are routed to humans while low-surprise pairs are inferred via transitivity, with the goal of reducing human comparisons below the O(n log n) baseline. On six benchmarks (STS-B, BIOSSES, SICKR-STS for text; KonIQ-10k, TID2013, LIVE Challenge for images), SGS reports skipping up to 535 comparisons per session and Kendall's τ×100 gains of +6 to +12 versus Active Elo under identical total budget.

Significance. If the transitivity inferences prove reliable, the method offers a practical way to allocate limited human budget in subjective ranking by using VLMs only for prioritization. The evaluation across six diverse benchmarks and the explicit comparison against an independent Active Elo baseline are strengths; the algorithmic exploitation of MergeSort structure plus ML guidance is a clear contribution if the safety assumption holds.

major comments (2)
  1. [Abstract] Abstract: the central claim that SGS achieves net τ gains by safely automating low-surprise pairs via transitivity is load-bearing, yet no direct metric (e.g., disagreement rate between inferred outcomes and held-out human labels) or propagation analysis through the bottom-up merge steps is reported.
  2. [Abstract] Abstract: the reported +6 to +12 τ×100 improvements lack error bars, ablation of the three Surprise Scorer components, and any validation of how VLM confidence is computed, so it is impossible to determine whether the gains are robust or driven by post-hoc selection of automatable pairs.
minor comments (1)
  1. The manuscript would benefit from a table or figure that breaks down the number of skipped comparisons and the resulting τ per benchmark, including standard deviations across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger validation of the transitivity mechanism and robustness of the reported gains. We address each point below and will incorporate the suggested analyses in the revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that SGS achieves net τ gains by safely automating low-surprise pairs via transitivity is load-bearing, yet no direct metric (e.g., disagreement rate between inferred outcomes and held-out human labels) or propagation analysis through the bottom-up merge steps is reported.

    Authors: We agree that an explicit disagreement rate between transitivity-inferred outcomes and held-out human labels, along with propagation analysis across merge steps, would provide direct evidence for the safety of automated pairs. The current evaluation relies on end-to-end Kendall's τ under fixed budget as an implicit validation, since systematic inference errors would necessarily degrade final ranking quality relative to the Active Elo baseline. We will add both the disagreement metric (computed on a held-out human-labeled subset) and a step-wise propagation analysis in the revised manuscript. revision: yes

  2. Referee: [Abstract] Abstract: the reported +6 to +12 τ×100 improvements lack error bars, ablation of the three Surprise Scorer components, and any validation of how VLM confidence is computed, so it is impossible to determine whether the gains are robust or driven by post-hoc selection of automatable pairs.

    Authors: The reported gains are observed consistently across all six benchmarks, which provides some indication of robustness, but we acknowledge that the absence of error bars, component ablations, and explicit VLM confidence validation leaves open the possibility of post-hoc effects. We will add (i) error bars from multiple independent runs, (ii) ablations isolating each Surprise Scorer component (position-bias-cancelled VLM confidence, Elo gap, vote entropy), and (iii) details on the VLM confidence computation procedure to allow readers to assess whether gains are driven by the proposed scoring rather than selective automation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation against independent baseline

full rationale

The paper presents SGS as an empirical scheduling method that combines a standard bottom-up MergeSort, a composite Surprise Scorer (VLM confidence + Elo gap + vote entropy), and an adaptive allocator to skip comparisons via transitivity. Reported gains (+6 to +12 Kendall's τ×100 vs Active Elo) are measured on six external benchmarks under fixed budget, with no equations, fitted parameters, or self-citations shown to reduce the central performance claim to the inputs by construction. The transitivity assumption is algorithmic rather than redefined, and the evaluation uses an independent baseline without evidence of test-set fitting. This is a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the method rests on the domain assumption that transitivity can be safely applied to low-surprise pairs and that the VLM provides a useful signal for ambiguity. No free parameters or invented entities are explicitly quantified in the provided text.

axioms (1)
  • domain assumption Transitivity holds sufficiently often in the target ranking domains that low-surprise pairs can be inferred without error
    Invoked by the adaptive budget allocator to skip human comparisons.

pith-pipeline@v0.9.1-grok · 5826 in / 1291 out tokens · 40489 ms · 2026-06-30T11:18:31.730500+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    In: Advances in Neural Infor- mation Processing Systems (2024)

    Bergström, H., Carlsson, E., Dubhashi, D., Johansson, F.D.: Active Preference Learning for Ordering Items In- and Out-of-Sample. In: Advances in Neural Infor- mation Processing Systems (2024)

  2. [2]

    The Method of Paired Comparisons

    Bradley, R.A., Terry, M.E.: Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika39(3/4), 324–345 (1952)

  3. [3]

    In: Proceedings of the 11th International Workshop on Semantic Evaluation (2017)

    Cer, D., et al.: SemEval-2017 Task 1: Semantic Textual Similarity. In: Proceedings of the 11th International Workshop on Semantic Evaluation (2017)

  4. [4]

    In: Pro- ceedings of the 22nd International Conference on Machine Learning

    Chu, W., Ghahramani, Z.: Preference Learning with Gaussian Processes. In: Pro- ceedings of the 22nd International Conference on Machine Learning. pp. 137–144 (2005)

  5. [5]

    IEEE Transactions on Image Processing25(1), 372–387 (2016) 14 Y

    Ghadiyaram, D., Bovik, A.C.: Massive Online Crowdsourced Study of Subjective and Objective Picture Quality. IEEE Transactions on Image Processing25(1), 372–387 (2016) 14 Y. Park et al

  6. [6]

    IEEE Transactions on Image Processing29, 4041–4056 (2020)

    Hosu, V., et al.: KonIQ-10k: An Ecologically Valid Database for Deep Learning of Blind Image Quality Assessment. IEEE Transactions on Image Processing29, 4041–4056 (2020)

  7. [7]

    Bayesian Active Learning for Classification and Preference Learning

    Houlsby, N., Huszár, F., Ghahramani, Z., Lengyel, M.: Bayesian Active Learning for Classification and Preference Learning. arXiv preprint arXiv:1112.5745 (2012)

  8. [8]

    In: Advances in Neural Information Processing Systems

    Jamieson, K.G., Nowak, R.D.: Active Ranking Using Pairwise Comparisons. In: Advances in Neural Information Processing Systems. pp. 2240–2248 (2011)

  9. [9]

    arXiv preprint arXiv:2202.04823 (2022)

    Jang, I., Danley, G., Chang, K., Kalpathy-Cramer, J.: Decreasing Annotation Burden of Pairwise Comparisons with Human-in-the-Loop Sorting. arXiv preprint arXiv:2202.04823 (2022)

  10. [10]

    In: Proceedings of the Conference on Language Modeling (2024)

    Liu, T., Zheng, J., Fei, H.: PairS: Pairwise Sequence Ranking with Merge Sort and LLM Uncertainty. In: Proceedings of the Conference on Language Modeling (2024)

  11. [11]

    arXiv preprint arXiv:2505.24643 (2025)

    Luo, T., et al.: Are Optimal Algorithms Still Optimal? Sorting and Searching with LLMs. arXiv preprint arXiv:2505.24643 (2025)

  12. [12]

    In: Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (2024)

    Ma, Y., et al.: Active Ranking with Effective Resistance-Based Pair Selection. In: Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (2024)

  13. [13]

    In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (2014)

    Marelli, M., et al.: A SICK Cure for the Evaluation of Compositional Distribu- tional Semantic Models. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (2014)

  14. [14]

    In: Proceedings of the 34th International Conference on Machine Learning

    Maystre, L., Grossglauser, M.: Just Sort It! A Simple and Effective Approach to Active Preference Learning. In: Proceedings of the 34th International Conference on Machine Learning. pp. 2344–2353 (2017)

  15. [15]

    In: Pro- ceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (2023)

    Muennighoff, N., et al.: MTEB: Massive Text Embedding Benchmark. In: Pro- ceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (2023)

  16. [16]

    In: Proceedings of the 34th ACM International Conference on Information and Knowledge Management

    Park, Y., Chung, H., Jang, I.: EZ-Sort: Efficient Pairwise Comparison via Zero-Shot CLIP-Based Pre-Ordering and Human-in-the-Loop Sorting. In: Proceedings of the 34th ACM International Conference on Information and Knowledge Management. pp. 5120–5124 (2025)

  17. [17]

    In: Pacific-Asia Conference on Knowledge Discovery and Data Mining

    Park, Y., Chung, H., Jang, I.: Dodgersort: Uncertainty-aware vlm-guided human- in-the-loop pairwise ranking. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. pp. 461–473. Springer (2026)

  18. [18]

    MetaRanker: Human-in-the-loop Active Ranking for Metalens Image Quality

    Park, Y., Chung, H., Jang, I.: MetaRanker: Human-in-the-loop Active Ranking for Metalens Image Quality. arXiv preprint arXiv:2605.29212 (2026)

  19. [19]

    Signal Processing: Image Communication30, 57–77 (2015)

    Ponomarenko, N., et al.: Image Database TID2013. Signal Processing: Image Communication30, 57–77 (2015)

  20. [20]

    In: Findings of the Association for Computational Linguistics: NAACL 2024 (2024)

    Qin, Z., et al.: Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. In: Findings of the Association for Computational Linguistics: NAACL 2024 (2024)

  21. [21]

    In: Proceedings of the 38th International Conference on Machine Learning

    Radford, A., et al.: Learning Transferable Visual Models from Natural Language Supervision. In: Proceedings of the 38th International Conference on Machine Learning. pp. 8748–8763 (2021)

  22. [22]

    arXiv preprint arXiv:2406.07791 (2024)

    Shi, Y., et al.: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2406.07791 (2024)

  23. [23]

    Ozt"urk, H.,

    Sogancioglu, G., "Ozt"urk, H., "Ozg"ur, A.: BIOSSES: A Semantic Sentence Similarity Estimation System for the Biomedical Domain. Bioinformatics33(14), i49–i58 (2017)

  24. [24]

    Psychological Review34(4), 273–286 (1927) Surprise-Guided MergeSort 15

    Thurstone, L.L.: A Law of Comparative Judgment. Psychological Review34(4), 273–286 (1927) Surprise-Guided MergeSort 15

  25. [25]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (2024)

    Wang, P., et al.: Large Language Models are Not Fair Evaluators. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (2024)