REC-RL: Referring expression counting via Gaussian and range-based reward optimization
Pith reviewed 2026-05-20 19:47 UTC · model grok-4.3
The pith
REC-RL optimizes referring expression counting by rewarding range accuracy and Gaussian precision during intermediate reasoning steps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
REC-RL shows that explicitly optimizing the reasoning process via a think-range-answer structure and combined range-based plus Gaussian rewards produces better performance in referring expression counting than methods relying only on final accuracy signals.
What carries the argument
The think-range-answer paradigm, which structures internal decision-making for focus prediction, powered by an accuracy reward that integrates range-based interval supervision with Gaussian-based precision guidance.
If this is right
- Performance improves consistently over rule-based reinforcement learning baselines on referring expression counting tasks.
- The model generalizes robustly across multiple benchmarks without task-specific retraining.
- Training proceeds without any extra annotations beyond standard image-expression pairs.
- Generated outputs follow more reliable structured formats due to the added format reward.
- Intermediate reasoning steps become more focused and aligned with typical human visual attention patterns.
Where Pith is reading between the lines
- The reward structure could transfer to other language-guided visual tasks such as referring expression comprehension or visual grounding.
- Gaussian precision terms might improve localization accuracy in related object detection settings.
- Varying the width of range intervals could be tested as a way to balance supervision strength.
- The overall approach suggests a path toward more interpretable step-by-step reasoning in larger vision-language models.
Load-bearing premise
Modeling intermediate focus prediction as internal decision-making via the think-range-answer paradigm produces better alignment with human perception and performance gains without requiring additional annotations.
What would settle it
Ablating the range-based interval and Gaussian precision components from the accuracy reward and measuring whether performance on standard referring expression counting benchmarks drops to or below baseline levels.
read the original abstract
Referring expression counting (REC) is an intention-driven task that requires context-aware visual reasoning. While recent vision-language models incorporate language for visual understanding, most existing REC methods rely on rulebased reinforcement learning with rewards focused primarily on final accuracy, overlooking the quality of intermediate reasoning. We propose REC-RL, a reinforcement learning framework that introduces a think-range-answer paradigm to explicitly optimize the visual reasoning process. RECRL employs Group Relative Policy Optimization and two lightweight rewards: an accuracy reward that combines range-based interval supervision with Gaussian-based precision guidance, and a format reward that enforces structured outputs. By modeling intermediate focus prediction as internal decision-making, REC-RL avoids additional annotations and better aligns with human perception. Extensive experiments demonstrate consistent improvements over strong baselines and robust generalization across benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces REC-RL, a reinforcement learning framework for referring expression counting (REC). It proposes a think-range-answer paradigm to explicitly optimize intermediate visual reasoning steps. The method applies Group Relative Policy Optimization together with two lightweight rewards—an accuracy reward that fuses range-based interval supervision and Gaussian-based precision guidance, plus a format reward that enforces structured outputs. The central claim is that modeling focus prediction as internal decision-making yields consistent performance gains over strong baselines, robust generalization across benchmarks, and better alignment with human perception, all without requiring additional annotations.
Significance. If the reported gains hold under rigorous evaluation, the work offers a practical way to incorporate intermediate reasoning supervision into RL for vision-language models on counting tasks. The avoidance of extra annotations and the use of lightweight, combined rewards are clear strengths. The approach could influence subsequent research on RL-for-reasoning pipelines in multimodal settings, provided the experimental evidence demonstrates statistically reliable improvements rather than isolated gains.
major comments (2)
- [§4] §4 (Method), reward formulation: the accuracy reward is presented as combining range-based and Gaussian components, yet the manuscript does not specify whether the Gaussian variance or range thresholds are fixed a priori or tuned on the validation set; if the latter, the claim of lightweight, annotation-free supervision requires explicit confirmation that no task-specific hyperparameter search was performed.
- [Table 2] Table 2 (main results): while improvements over baselines are reported, the absence of standard deviations across multiple random seeds or statistical significance tests makes it difficult to judge whether the observed gains are robust or could be explained by training variance, which directly affects the central claim of consistent and generalizable improvements.
minor comments (2)
- [Abstract] The abstract states 'consistent improvements' and 'robust generalization' without any numerical values; adding at least the key metric deltas (e.g., +X% on RefCOCO) would improve readability.
- [§3] Notation for the think-range-answer steps is introduced in §3 but not consistently reused in the reward equations; aligning the variable names would reduce reader effort.
Simulated Author's Rebuttal
We sincerely thank the referee for the constructive and detailed feedback. The comments help clarify key aspects of our method and strengthen the empirical claims. We respond to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [§4] §4 (Method), reward formulation: the accuracy reward is presented as combining range-based and Gaussian components, yet the manuscript does not specify whether the Gaussian variance or range thresholds are fixed a priori or tuned on the validation set; if the latter, the claim of lightweight, annotation-free supervision requires explicit confirmation that no task-specific hyperparameter search was performed.
Authors: We thank the referee for this observation. The Gaussian variance (set to 1.0) and range thresholds (e.g., intervals of width 5 for counting bins) are fixed a priori according to the typical scale of referring expression counts in the benchmarks; no validation-set tuning or task-specific hyperparameter search was performed. This choice keeps the reward lightweight and annotation-free. We will revise §4 to state these fixed values and the rationale explicitly. revision: yes
-
Referee: [Table 2] Table 2 (main results): while improvements over baselines are reported, the absence of standard deviations across multiple random seeds or statistical significance tests makes it difficult to judge whether the observed gains are robust or could be explained by training variance, which directly affects the central claim of consistent and generalizable improvements.
Authors: We agree that variability measures and significance testing would better support the robustness claims. In the revised manuscript we will report means and standard deviations over multiple random seeds (at least three independent runs) for the main results in Table 2 and will add paired statistical significance tests against the baselines. These additions will directly address concerns about training variance. revision: yes
Circularity Check
No significant circularity; empirical RL framework with independent experimental validation
full rationale
The paper presents REC-RL as an empirical reinforcement learning approach for referring expression counting, introducing a think-range-answer paradigm, Group Relative Policy Optimization, and composite rewards (range-based interval supervision combined with Gaussian precision guidance, plus a format reward). No equations, derivations, or first-principles predictions are described that reduce claimed performance gains to quantities defined by the same fitted parameters or self-referential inputs. The central claims rest on experimental outcomes across benchmarks rather than any closed logical loop, self-definitional reward construction, or load-bearing self-citation chain. The argument is self-contained as a standard proposal whose validity is externally falsifiable via replication on held-out data.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Gaussian Function... F(y, Ngt) = exp(−k·((Npred−Ngt)/max(Ngt,ε))²) ... k=20 ... range-based reward rrange = Ivalid · ½ [I(Low,Ngt)·F(Low,Ngt) + I(Up,Ngt)·F(Up,Ngt)]
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
think–range–answer paradigm ... GRPO ... accuracy reward racc = rans + α·rrange
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Referring Expression Counting (REC) is a fine-grained com- puter vision task that aims to quantify objects specified by both category and contextual attributes [1]. Unlike conven- tional class-level counting, REC requires understanding com- positional queries such as “green pears on the table” within a broader category like “fruit,” where att...
-
[2]
REC-RL: Referring expression counting via Gaussian and range-based reward optimization
and VLM-R1 [8] successfully adapt this paradigm to vision-language models, where rule-based reward functions provide reliable outcome supervision [9]. Collectively, these studies suggest that RL is particularly effective for deter- ministic tasks like REC, as it yields stable and interpretable training signals. Despite this progress, existing REC methods ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
METHOD The REC-RL proposes a novel framework tailored for re- ferring expression counting viaGaussian and Range-Based Reward Optimization. As illustrated in Fig.1, for a given questionq, the GRPO algorithm first samplesNcandidate re- sponses{o 1, o2,· · ·, o N }from the policy modelθ old, where each response is structured following the think–range–answer ...
-
[4]
EXPERIMENTS 3.1. Dataset We evaluate our method on the REC-8K dataset [1], which contains 8,011 images annotated with referring expres- sion–count pairs. The dataset is split into 4,923 images (10,555 pairs) for training, 1,566 images (3,336 pairs) for validation, and 1,522 images (3,231 pairs) for testing. 3.2. Implementation Details All experiments are ...
work page 2048
-
[5]
CONCLUSION In this work, we reevaluate the prevailing R1-like training framework for referring expression counting (REC) through the lenses of structural reasoning and non-linear reward shap- ing. First, we introduce thethink-range-answerparadigm, which reframes REC from a direct mapping task to a struc- tured decision-making process. By treating range pr...
-
[6]
Referring expres- sion counting,
S. Dai, J. Liu, and N.-M. Cheung, “Referring expres- sion counting,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. IEEE, 2024, pp. 16985–16995
work page 2024
-
[7]
Flamingo: a visual language model for few-shot learning,
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al., “Flamingo: a visual language model for few-shot learning,” inAdvances in Neural Infor- mation Processing Systems, 2022, vol. 35, pp. 23716– 23736
work page 2022
-
[8]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforce- ment learning,” inarXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y .K. Li, Y . Wu, et al., “Deepseek- math: Pushing the limits of mathematical reason- ing in open language models,” inarXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, T. Fan, G. Liu, L. Liu, X. Liu, et al., “Dapo: An open- source llm reinforcement learning system at scale,” in arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q.V . Le, D. Zhou, et al., “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, 2022, vol. 35, pp. 24824–24837
work page 2022
-
[12]
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
Y . Yang, X. He, H. Pan, X. Jiang, Y . Deng, X. Yang, H. Lu, D. Yin, F. Rao, M. Zhu, et al., “R1- onevision: Advancing generalized multimodal reason- ing through cross-modal formalization,” inarXiv preprint arXiv:2503.10615, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
H. Shen, P. Liu, J. Li, C. Fang, Y . Ma, J. Liao, Q. Shen, Z. Zhang, K. Zhao, Q. Zhang, et al., “Vlm-r1: A stable and generalizable r1-style large vision-language model,” inarXiv preprint arXiv:2504.07615, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
Y . Deng, H. Bansal, F. Yin, N. Peng, W. Wang, and K.-W. Chang, “Openvlthinker: An early exploration to complex vision-language reasoning via iterative self- improvement,” inarXiv preprint arXiv:2503.17352, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Visual-rft: Visual reinforcement fine-tuning,
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang, “Visual-rft: Visual reinforcement fine-tuning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 2034– 2044
work page 2025
-
[16]
H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,” inThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[17]
Single- image crowd counting via multi-column convolutional neural network,
Y . Zhang, D. Zhou, S. Chen, S. Gao, and Y . Ma, “Single- image crowd counting via multi-column convolutional neural network,” inProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition. IEEE, 2016, pp. 589–597
work page 2016
-
[18]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection,
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 38–55
work page 2024
-
[19]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al., “Qwen2.5-vl technical report,” inarXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Z. Wang, P. Feng, Y . Lin, S. Cai, Z. Bian, J. Yan, and X. Zhu, “Crowdvlm-r1: Expanding r1 ability to vision language model for crowd counting using fuzzy group relative policy reward,” inarXiv preprint arXiv:2504.03724, 2025
-
[21]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, et al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2024, pp. 24185–24198
work page 2024
-
[22]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F.L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” inarXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al., “The llama 3 herd of models,” inarXiv e-prints, 2024, pp. arXiv–2407
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.