Recognition: unknown
Visual-RFT: Visual Reinforcement Fine-Tuning
Pith reviewed 2026-05-13 22:11 UTC · model grok-4.3
The pith
Visual-RFT lets large vision-language models learn visual tasks from perceptual rewards instead of labeled data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Visual-RFT first uses Large Vision-Language Models to generate multiple responses containing reasoning tokens and final answers for each input, and then uses proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization. Different verifiable reward functions are designed for different perception tasks, such as the Intersection over Union reward for object detection, producing competitive performance and advanced generalization on fine-grained image classification, few-shot object detection, reasoning grounding, and open-vocabulary object detection benchmarks.
What carries the argument
Visual perception verifiable reward functions, such as Intersection over Union for object detection, paired with Group Relative Policy Optimization to update the policy from multiple generated responses.
If this is right
- Delivers a 24.3 percent accuracy increase over baseline in one-shot fine-grained image classification using around 100 samples.
- Exceeds supervised fine-tuning by 21.9 points on COCO two-shot object detection and by 15.4 points on LVIS.
- Improves results on reasoning grounding and open-vocabulary object detection relative to supervised baselines.
- Offers a data-efficient alternative to supervised fine-tuning for domain-specific adaptation of vision-language models.
Where Pith is reading between the lines
- The same reward-driven approach could transfer to additional visual tasks such as segmentation if equivalent quantifiable metrics are available.
- Future work could combine visual rewards with language-based rewards to strengthen cross-modal reasoning chains.
- Gains may compound if base large vision-language models improve at generating diverse initial responses before optimization begins.
Load-bearing premise
That visual perception reward functions like IoU supply sufficiently dense and unbiased signals to guide effective policy optimization on visual tasks.
What would settle it
Apply Visual-RFT to a new visual task lacking a clear quantitative reward metric, such as subjective image quality assessment, and check whether accuracy fails to exceed that of supervised fine-tuning on the same limited data.
read the original abstract
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO). We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection. Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT). For example, Visual-RFT improves accuracy by $24.3\%$ over the baseline in one-shot fine-grained image classification with around 100 samples. In few-shot object detection, Visual-RFT also exceeds the baseline by $21.9$ on COCO's two-shot setting and $15.4$ on LVIS. Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Visual-RFT, extending reinforcement fine-tuning (RFT) with verifiable rewards to visual perception tasks in LVLMs. It generates multiple reasoning+answer responses per input via the base model, applies task-specific visual rewards (exemplified by IoU for detection), and optimizes the policy with GRPO. Experiments claim large gains over SFT on few-shot fine-grained classification (+24.3% with ~100 samples), COCO 2-shot detection (+21.9), LVIS, reasoning grounding, and open-vocabulary detection, positioning the method as a data-efficient, reward-driven alternative to supervised fine-tuning.
Significance. If the reported gains are shown to arise specifically from dense visual-perception rewards rather than from multi-sample generation or GRPO regularization alone, the work would demonstrate a practical route to reward-driven adaptation of LVLMs in data-scarce regimes. The multi-task coverage and direct SFT comparisons are positive; however, the absence of reward-function pseudocode, training-budget controls, and significance tests limits the strength of the central claim that visual rewards are the key differentiator.
major comments (3)
- [Abstract / §3] Abstract and §3 (reward design): the manuscript exemplifies only the IoU reward for detection; the exact functional form of the 'visual perception verifiable reward' for fine-grained classification is never stated. If this reward reduces to exact string match on the final answer token (or LVLM-judged correctness), it supplies no additional visual signal beyond the cross-entropy loss already used in the SFT baseline on the same ~100 samples, undermining the premise that visual rewards drive the 24.3% lift.
- [§4] §4 (experiments): no training-budget table, no wall-clock or token counts for the GRPO runs versus the SFT baselines, and no statistical significance tests (e.g., standard error over multiple seeds) are reported for the headline numbers (+24.3% classification, +21.9 COCO 2-shot). Without these controls it is impossible to rule out that the observed differences arise from longer effective optimization or variance rather than the visual reward.
- [§3.2] §3.2 (GRPO formulation): the paper adopts the standard GRPO objective without modification. The manuscript must isolate whether any performance increment survives when the visual reward is replaced by a non-visual answer-match reward; otherwise the central claim that 'visual perception verifiable reward functions' are the operative ingredient remains untested.
minor comments (2)
- [§3] Notation for the reward functions is introduced only by example; a single compact equation or pseudocode block listing r_class, r_det, r_grounding would improve reproducibility.
- [§4] Figure captions and axis labels in the few-shot detection plots omit the exact number of training samples per class; this information is only recoverable from the text.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and will revise the paper to strengthen the presentation and claims.
read point-by-point responses
-
Referee: [Abstract / §3] Abstract and §3 (reward design): the manuscript exemplifies only the IoU reward for detection; the exact functional form of the 'visual perception verifiable reward' for fine-grained classification is never stated. If this reward reduces to exact string match on the final answer token (or LVLM-judged correctness), it supplies no additional visual signal beyond the cross-entropy loss already used in the SFT baseline on the same ~100 samples, undermining the premise that visual rewards drive the 24.3% lift.
Authors: We thank the referee for identifying this omission. The reward for fine-grained classification is a binary verifiable function: reward = 1 if the final answer string exactly matches the ground-truth class label, and 0 otherwise. This is directly computable from the output without external judges. While the reward evaluates answer correctness, the Visual-RFT pipeline differs from SFT by sampling multiple reasoning+answer trajectories per image and optimizing via GRPO, which reinforces visual reasoning paths that lead to correct classifications. We will add the explicit mathematical definition and pseudocode for the classification reward (alongside the IoU formulation) in the revised Section 3. revision: yes
-
Referee: [§4] §4 (experiments): no training-budget table, no wall-clock or token counts for the GRPO runs versus the SFT baselines, and no statistical significance tests (e.g., standard error over multiple seeds) are reported for the headline numbers (+24.3% classification, +21.9 COCO 2-shot). Without these controls it is impossible to rule out that the observed differences arise from longer effective optimization or variance rather than the visual reward.
Authors: We agree that these controls are required for a rigorous comparison. In the revised manuscript we will insert a new table in Section 4 that reports training budgets (total tokens, wall-clock time, and optimization steps) for Visual-RFT versus SFT on every benchmark. We will also rerun the primary experiments across multiple random seeds and report means with standard errors to quantify statistical significance of the gains. revision: yes
-
Referee: [§3.2] §3.2 (GRPO formulation): the paper adopts the standard GRPO objective without modification. The manuscript must isolate whether any performance increment survives when the visual reward is replaced by a non-visual answer-match reward; otherwise the central claim that 'visual perception verifiable reward functions' are the operative ingredient remains untested.
Authors: We accept the need for this isolation experiment. In the revision we will add an ablation that replaces the task-specific visual rewards (IoU for detection/grounding, exact-match for classification) with a non-visual answer-match reward that only checks final-answer correctness. Performance differences will be reported to test whether the visual component of the reward is responsible for the observed gains. For spatial tasks the non-visual reward necessarily omits dense localization signals, but we will present the comparison explicitly. revision: yes
Circularity Check
No significant circularity in Visual-RFT derivation chain
full rationale
The paper's core method applies standard GRPO (from external prior work) to multiple LVLM-generated responses, using externally defined verifiable rewards such as IoU for detection and analogous task-specific functions for classification/grounding. These rewards are constructed from standard metrics independent of the model's fitted parameters or the target performance numbers. Reported gains (e.g., +24.3% on one-shot classification) are empirical benchmark results, not mathematical predictions or derivations that reduce to the inputs by construction. No self-definitional steps, fitted-input-as-prediction, or load-bearing self-citation chains appear in the described procedure. The derivation remains self-contained against external benchmarks and standard RL components.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Verifiable reward functions such as IoU can be computed reliably from model outputs and ground truth without additional learned components.
Forward citations
Cited by 28 Pith papers
-
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
-
Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning
Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...
-
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
-
Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic
SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.
-
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
-
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.
-
Visual-ERM: Reward Modeling for Visual Equivalence
Visual-ERM is a new multimodal reward model that supplies fine-grained visual feedback for training vision-language models on chart-to-code, table, and SVG tasks, yielding measurable gains over prior rewards.
-
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using ...
-
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
-
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.
-
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
A Group Relative Policy Optimization framework with concordance correlation coefficient rewards improves MLLM regression accuracy on long-tailed distributions, especially in medium- and few-shot regimes, without model...
-
Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization
Introduces TA-MDP and proves GRPO convergence at O(1/sqrt(T)), a reward decomposition bound, and PAC-Bayes generalization for tool-augmented LVLM policies.
-
CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.
-
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
-
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...
-
Specificity-aware reinforcement learning for fine-grained open-world classification
SpeciaRL applies a dynamic verifier-based reward in reinforcement learning to steer reasoning LMMs toward correct and specific predictions on fine-grained open-world image classification tasks.
-
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
-
Perceptual Flow Network for Visually Grounded Reasoning
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
-
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
-
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents
HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
-
Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning
CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...
-
SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units
SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.
-
SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning
SVSR trains multimodal models to verify and correct their own reasoning using a preference dataset, supervised fine-tuning, and semi-online DPO with a teacher model.
-
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
-
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation
RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Reference graph
Works this paper leans on
-
[1]
Marwa Abdulhai, Isadora White, Charlie Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, and Sergey Levine. Lmrl gym: Benchmarks for multi-turn reinforcement learn- ing with language models. arXiv preprint arXiv:2311.18232,
-
[2]
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024. 4
work page internal anchor Pith review arXiv 2024
-
[3]
Grounding large language models in interactive environments with on- line reinforcement learning
Thomas Carta, Cl ´ement Romac, Thomas Wolf, Sylvain Lam- prier, Olivier Sigaud, and Pierre-Yves Oudeyer. Grounding large language models in interactive environments with on- line reinforcement learning. In ICLR, 2023. 3
work page 2023
-
[4]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Lvis: A dataset for large vocabulary instance segmentation
Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019. 3, 7, 8, 9
work page 2019
-
[6]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kem- ing Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv:2412.16720, 2024. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Preference optimiza- tion for reasoning with pseudo feedback
Fangkai Jiao, Geyang Guo, Xingxing Zhang, Nancy F Chen, Shafiq Joty, and Furu Wei. Preference optimiza- tion for reasoning with pseudo feedback. arXiv preprint arXiv:2411.16345, 2024. 4
-
[9]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 9
work page 2023
-
[10]
3d object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCV workshops, 2013. 7
work page 2013
-
[11]
Lisa: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9579–9589, 2024. 3, 6, 8, 9
work page 2024
-
[12]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. T\” ulu 3: Pushing frontiers in open language model post- training. arXiv preprint arXiv:2411.15124, 2024. 1, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 3 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Open-vocabulary semantic segmentation with mask-adapted clip
Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 7061–7070, 2023. 8
work page 2023
-
[15]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 3, 8
work page 2014
-
[16]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Skywork-reward: Bag of tricks for reward modeling in llms
Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-Reward: Bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451, 2024. 1, 4
-
[18]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision , pages 38–55. Springer, 2024. 3, 8, 9
work page 2024
-
[19]
Mia-dpo: Multi-image augmented di- rect preference optimization for large vision-language mod- els
Ziyu Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Conghui He, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Mia-dpo: Multi-image augmented di- rect preference optimization for large vision-language mod- els. arXiv preprint arXiv:2410.17637, 2024. 4
-
[20]
Reft: Reasoning with reinforced fine-tuning, 2024
Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning, 2024. 4
work page 2024
-
[21]
Fine-Grained Visual Classification of Aircraft
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classi- fication of aircraft. arXiv preprint arXiv:1306.5151 , 2013. 7
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[22]
Automated flower classification over a large number of classes
Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In ICVGIP, 2008. 7
work page 2008
- [23]
- [24]
-
[25]
Training lan- guage models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. In NeurIPS, 2022. 3
work page 2022
-
[26]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to fol- low instructions ...
work page 2022
-
[27]
Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In CVPR, 2012. 7
work page 2012
-
[28]
Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kiant ´e Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Han- naneh Hajishirzi, and Yejin Choi. Is reinforcement learning (not) for natural language processing: Benchmarks, base- lines, and building blocks for natural language policy opti- mization. In ICLR, 2023. 3
work page 2023
-
[29]
Glamm: Pixel grounding large multimodal model
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdel- rahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024. 9
work page 2024
-
[30]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv:1707.06347, 2017. 4
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[31]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Offline RL for natural language generation with implicit language q learning
Charlie Victor Snell, Ilya Kostrikov, Yi Su, Sherry Yang, and Sergey Levine. Offline RL for natural language generation with implicit language q learning. In ICLR, 2023. 3
work page 2023
-
[33]
Learning to summarize with human feed- back
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feed- back. In NeurIPS, 2022. 3
work page 2022
-
[35]
Aligning Large Multimodal Models with Factually Augmented RLHF
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multi- modal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023. 4
work page internal anchor Pith review arXiv 2023
-
[36]
Aligning large multi- modal models with factually augmented rlhf
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multi- modal models with factually augmented rlhf. In ACL, 2024. 3
work page 2024
-
[37]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025. 1, 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 3, 7, 9
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024. 4 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In ICLR, 2023. 3
work page 2023
-
[41]
Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, et al. Internlm-math: Open math large lan- guage models toward verifiable reasoning. arXiv preprint arXiv:2402.06332, 2024. 4
-
[42]
Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. RlHF-V: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In CVPR, 2024. 4
work page 2024
-
[43]
RLAIF-V: Aligning mllms through open-source ai feedback for super gpt-4v trustworthiness
Tianyu Yu, Haoye Zhang, Yuan Yao, Yunkai Dang, Da Chen, Xiaoman Lu, Ganqu Cui, Taiwen He, Zhiyuan Liu, Tat-Seng Chua, et al. RLAIF-V: Aligning mllms through open-source ai feedback for super gpt-4v trustworthiness. arXiv preprint arXiv:2405.17220, 2024. 4
-
[44]
Contextual object detection with mul- timodal large language models
Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, and Chen Change Loy. Contextual object detection with mul- timodal large language models. IJCV, 2024. 3
work page 2024
-
[45]
Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, et al. A simple yet effective multi-modal reward model. arXiv preprint arXiv:2501.12368, 2025. 1, 4
-
[46]
Codedpo: Aligning code models with self generated and verified source code
Kechi Zhang, Ge Li, Yihong Dong, Jingjing Xu, Jun Zhang, Jing Su, Yongfei Liu, and Zhi Jin. Codedpo: Aligning code models with self generated and verified source code. arXiv preprint arXiv:2410.05605, 2024. 4
-
[47]
Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, et al. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual in- put and output. arXiv preprint arXiv:2407.03320, 2024. 3
-
[48]
o1-coder: an o1 replication for coding
Yuxiang Zhang, Shangxi Wu, Yuqi Yang, Jiangming Shu, Jinlin Xiao, Chao Kong, and Jitao Sang. o1-coder: an o1 replication for coding. arXiv preprint arXiv:2412.00154 ,
-
[49]
Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization,
Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Ji- aqi Wang, and Conghui He. Beyond hallucinations: Enhanc- ing lvlms through hallucination-aware direct preference op- timization. arXiv preprint arXiv:2311.16839, 2023. 4
-
[51]
arXiv preprint arXiv:2402.11411 , year=
Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Aligning modalities in vision large lan- guage models via preference fine-tuning. arXiv preprint arXiv:2402.11411, 2024. 4
-
[52]
Archer: Training language model agents via hierarchical multi-turn rl
Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. Archer: Training language model agents via hierarchical multi-turn rl. In ICML, 2024. 3
work page 2024
-
[53]
Fine-Tuning Language Models from Human Preferences
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv:1909.08593, 2019. 3
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[54]
Generalized decoding for pixel, image, and lan- guage
Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, et al. Generalized decoding for pixel, image, and lan- guage. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 15116–15127,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.