Point-RFT: Improving Multimodal Visual Reasoning with Visually Grounded Thoughts in Reinforcement Learning

Abstract

We present Point-RFT, a novel multimodal reasoning framework that improves visual reasoning through visually grounded thoughts in reinforcement learning. Our approach demonstrates that "visually grounded CoT" is more effective for multimodal reasoning compared with text-only thoughts.

Point-RFT employs a two-stage training framework: (1) Supervised Format Finetuning (SFT) with our Point-CoT dataset of 71K examples, where every reasoning step is aligned with point-level visual references, and (2) Visually Grounded Reinforcement Finetuning (RFT) using Group-wise Relative Policy Optimization (GRPO) with dual rewards for format adherence and answer correctness.

Our method demonstrates superior generalization ability and interpretability on five out-of-domain benchmarks, showcasing the effectiveness of visually grounded reasoning in multimodal understanding tasks.

🌟 Key Features

Visually Grounded RFT

A novel multimodal reasoning framework explicitly designed for visually grounded reinforcement fine-tuning that aligns reasoning steps with visual references.

Enhanced Multimodal CoT

Empirically demonstrates that "visually grounded CoT" is more effective for multimodal reasoning compared with text-only thoughts, improving reasoning accuracy and interpretability.

Point-CoT Dataset

A curated 71K-example dataset where every reasoning step is aligned with point-level visual references, enabling supervised format finetuning that teaches the model to "think while pointing".

Superior Generalization

Demonstrates strong generalization ability and interpretability on five out-of-domain benchmarks, showing robust performance across diverse reasoning tasks.

⚙️ How it Works

Point-RFT employs a two-stage training framework:

1. Supervised Format Finetuning (SFT) with Point-CoT Dataset

Construct a Point-CoT dataset of 71K examples integrating step-by-step text rationales with explicit grounding to visual points
Dataset generation involves GPT-4o for CoT generation and Molmo-7B for point grounding, with cross-validation to ensure consistency
Fine-tune a base model (Qwen2.5-VL) to generate step-by-step reasoning traces explicitly linked to visual pointing
Model learns to output in specific format: <think> <points...>...</points> ... </think> <answer>...</answer>

2. Visually Grounded Reinforcement Finetuning (RFT) with GRPO

Further optimize the SFT model using Reinforcement Learning with Group-wise Relative Policy Optimization (GRPO)
Implement dual rewards:
- Format Reward (R_f): Measures structural adherence to the template
- Accuracy Reward (R_a): Computes answer correctness against ground truth
Optimize answer correctness and grounded rationale coherence by rewarding localized visual-textual reasoning paths

📊 Point-CoT Dataset

The Point-CoT dataset comprises 71K images covering diverse question types (e.g., counting, comparison, arithmetic) from a subset of Mammoth-VL
Each instance integrates the reasoning process with point grounding, creating a novel form of multimodal CoT
The pipeline ensures spatial-textual consistency through cross-validation between GPT-4o (reasoning) and Molmo-7B (grounding)

🏆 Case Studies

Chart Domain

General Domain

Point-RFT demonstrates superior performance across different domains, showing strong reasoning capabilities with visually grounded thoughts that improve both accuracy and interpretability.

📄 Citation

If you find Point-RFT useful in your research, please consider citing our paper:

@article{ni2025pointrft,
  title={Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning},
  author={Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, Lijuan Wang},
  journal={arXiv preprint arXiv:2505.19702},
  year={2025}
}