VLM judges exhibit task-dependent uncertainty in their scores, with conformal prediction revealing wide intervals for complex tasks and a decoupling between good ranking performance and poor absolute scoring reliability.
Algorithmic Learning in a Random World
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Conformal prediction coverage collapses before accuracy during lifelong LLM fine-tuning, and a lightweight calibration replay using small task buffers can restore nominal coverage.
citing papers explorer
-
VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation
VLM judges exhibit task-dependent uncertainty in their scores, with conformal prediction revealing wide intervals for complex tasks and a decoupling between good ranking performance and poor absolute scoring reliability.
-
Continual Calibration: Coverage Can Collapse Before Accuracy in Lifelong LLM Fine-Tuning
Conformal prediction coverage collapses before accuracy during lifelong LLM fine-tuning, and a lightweight calibration replay using small task buffers can restore nominal coverage.