Visually Grounded RFT
A novel multimodal reasoning framework explicitly designed for visually grounded reinforcement fine-tuning that aligns reasoning steps with visual references.
Point-RFT employs a two-stage training framework that combines supervised format fine-tuning with visually grounded reinforcement learning to improve multimodal reasoning capabilities.
We present Point-RFT, a novel multimodal reasoning framework that improves visual reasoning through visually grounded thoughts in reinforcement learning. Our approach demonstrates that "visually grounded CoT" is more effective for multimodal reasoning compared with text-only thoughts.
Point-RFT employs a two-stage training framework: (1) Supervised Format Finetuning (SFT) with our Point-CoT dataset of 71K examples, where every reasoning step is aligned with point-level visual references, and (2) Visually Grounded Reinforcement Finetuning (RFT) using Group-wise Relative Policy Optimization (GRPO) with dual rewards for format adherence and answer correctness.
Our method demonstrates superior generalization ability and interpretability on five out-of-domain benchmarks, showcasing the effectiveness of visually grounded reasoning in multimodal understanding tasks.
A novel multimodal reasoning framework explicitly designed for visually grounded reinforcement fine-tuning that aligns reasoning steps with visual references.
Empirically demonstrates that "visually grounded CoT" is more effective for multimodal reasoning compared with text-only thoughts, improving reasoning accuracy and interpretability.
A curated 71K-example dataset where every reasoning step is aligned with point-level visual references, enabling supervised format finetuning that teaches the model to "think while pointing".
Demonstrates strong generalization ability and interpretability on five out-of-domain benchmarks, showing robust performance across diverse reasoning tasks.
Point-RFT employs a two-stage training framework:
<think> <points...>...</points> ... </think> <answer>...</answer>
Point-RFT demonstrates superior performance across different domains, showing strong reasoning capabilities with visually grounded thoughts that improve both accuracy and interpretability.
If you find Point-RFT useful in your research, please consider citing our paper:
@article{ni2025pointrft,
title={Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning},
author={Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, Lijuan Wang},
journal={arXiv preprint arXiv:2505.19702},
year={2025}
}