Point-RFT

Improving Multimodal Visual Reasoning with Visually Grounded Thoughts in Reinforcement Learning

Microsoft
Point-RFT Framework Overview

Point-RFT employs a two-stage training framework that combines supervised format fine-tuning with visually grounded reinforcement learning to improve multimodal reasoning capabilities.

Abstract

We present Point-RFT, a novel multimodal reasoning framework that improves visual reasoning through visually grounded thoughts in reinforcement learning. Our approach demonstrates that "visually grounded CoT" is more effective for multimodal reasoning compared with text-only thoughts.

Point-RFT employs a two-stage training framework: (1) Supervised Format Finetuning (SFT) with our Point-CoT dataset of 71K examples, where every reasoning step is aligned with point-level visual references, and (2) Visually Grounded Reinforcement Finetuning (RFT) using Group-wise Relative Policy Optimization (GRPO) with dual rewards for format adherence and answer correctness.

Our method demonstrates superior generalization ability and interpretability on five out-of-domain benchmarks, showcasing the effectiveness of visually grounded reasoning in multimodal understanding tasks.

🌟 Key Features

Visually Grounded RFT

A novel multimodal reasoning framework explicitly designed for visually grounded reinforcement fine-tuning that aligns reasoning steps with visual references.

Enhanced Multimodal CoT

Empirically demonstrates that "visually grounded CoT" is more effective for multimodal reasoning compared with text-only thoughts, improving reasoning accuracy and interpretability.

Point-CoT Dataset

A curated 71K-example dataset where every reasoning step is aligned with point-level visual references, enabling supervised format finetuning that teaches the model to "think while pointing".

Superior Generalization

Demonstrates strong generalization ability and interpretability on five out-of-domain benchmarks, showing robust performance across diverse reasoning tasks.

⚙️ How it Works

Point-RFT employs a two-stage training framework:

1. Supervised Format Finetuning (SFT) with Point-CoT Dataset

  • Construct a Point-CoT dataset of 71K examples integrating step-by-step text rationales with explicit grounding to visual points
  • Dataset generation involves GPT-4o for CoT generation and Molmo-7B for point grounding, with cross-validation to ensure consistency
  • Fine-tune a base model (Qwen2.5-VL) to generate step-by-step reasoning traces explicitly linked to visual pointing
  • Model learns to output in specific format: <think> <points...>...</points> ... </think> <answer>...</answer>

2. Visually Grounded Reinforcement Finetuning (RFT) with GRPO

  • Further optimize the SFT model using Reinforcement Learning with Group-wise Relative Policy Optimization (GRPO)
  • Implement dual rewards:
    • Format Reward (R_f): Measures structural adherence to the template
    • Accuracy Reward (R_a): Computes answer correctness against ground truth
  • Optimize answer correctness and grounded rationale coherence by rewarding localized visual-textual reasoning paths

📊 Point-CoT Dataset

Point-CoT Dataset Overview
  • The Point-CoT dataset comprises 71K images covering diverse question types (e.g., counting, comparison, arithmetic) from a subset of Mammoth-VL
  • Each instance integrates the reasoning process with point grounding, creating a novel form of multimodal CoT
  • The pipeline ensures spatial-textual consistency through cross-validation between GPT-4o (reasoning) and Molmo-7B (grounding)

🏆 Case Studies

Chart Domain

Chart Domain Case Study

General Domain

General Domain Case Study

Point-RFT demonstrates superior performance across different domains, showing strong reasoning capabilities with visually grounded thoughts that improve both accuracy and interpretability.

📄 Citation

If you find Point-RFT useful in your research, please consider citing our paper:

@article{ni2025pointrft,
  title={Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning},
  author={Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, Lijuan Wang},
  journal={arXiv preprint arXiv:2505.19702},
  year={2025}
}