GroundedPRM: Tree-Guided and Fidelity-Aware
Process Reward Modeling for Step-Level Reasoning

1 LMU Munich 2 Technical University of Munich 3 Fudan University 4 University of Heidelberg
5 University of Oxford 6 AWS AI 7 Munich Center for Machine Learning (MCML)

Abstract

Process Reward Models (PRMs) aim to improve multi-step reasoning in Large Language Models (LLMs) by supervising intermediate steps and identifying errors throughout the reasoning process. However, building effective PRMs remains challenging due to the lack of scalable, high-quality annotations. Existing approaches rely on costly human labeling, LLM-based self-evaluation that is prone to hallucination, or Monte Carlo (MC) estimation, which infers step quality solely from rollout outcomes and often introduces noisy, misaligned supervision due to credit misattribution. These issues result in three core limitations: noisy rewards, low factual fidelity, and misalignment with step-level reasoning objectives. To address these challenges, we introduce GroundedPRM, a tree-guided and fidelity-aware framework for automatic process supervision. To reduce reward noise and enable fine-grained credit assignment, we construct structured reasoning paths via Monte Carlo Tree Search (MCTS). To eliminate hallucinated supervision, we validate each intermediate step using an external tool, providing precise, execution-grounded correctness signals. To combine both step-level validation and global outcome assessment, we design a hybrid reward aggregation mechanism that fuses tool-based verification with MCTS-derived feedback. Finally, we format the reward signal into a rationale-enhanced, generative structure to promote interpretability and compatibility with instruction-tuned LLMs. GroundedPRM is trained on only 40K automatically labeled samples, amounting to just 10% of the data used by the best-performing PRM trained with auto-labeled supervision. Nevertheless, it achieves up to a 26% relative improvement in average performance on ProcessBench. When used for reward-guided greedy search, GroundedPRM outperforms even PRMs trained with human-labeled supervision, offering a scalable and verifiable path toward high-quality process-level reasoning.

Key Contributions

🎯 Addresses Three Core Challenges

We tackle noisy rewards, low factual fidelity, and misalignment in process supervision through a unified framework combining tree search and tool verification.

🔬 Novel Framework Design

First work to combine MCTS-guided path construction with execution-based verification for automatic, high-fidelity process reward annotation.

📊 Superior Data Efficiency

Achieves SOTA performance with only 10% training data compared to existing auto-labeled methods, demonstrating exceptional sample efficiency.

🚀 Strong Empirical Results

+26% relative improvement on ProcessBench average F1, outperforming even human-labeled PRMs in reward-guided search applications.

Method Overview

GroundedPRM Framework Overview
Figure 1: Overview of the GroundedPRM Framework. GroundedPRM constructs reasoning paths via MCTS, where each node corresponds to an LLM-generated step. During simulation, intermediate steps are verified using an external tool, and final answers are checked against ground truth. Step-level and outcome-level correctness signals are aggregated into a rollout reward, which is backpropagated along the tree to update node statistics; the next node is then selected by UCT, continuing the MCTS search until convergence or budget exhaustion. The framework enables verifiable, interpretable, and structure-aware process supervision for multi-step reasoning. The generative rationale provides interpretable feedback for each step.

Pipeline

GroundedPRM generates high-quality process supervision through a three-stage pipeline:

1 Tree-Guided Path Construction via MCTS

We build structured reasoning trees using Monte Carlo Tree Search, where each node represents an intermediate reasoning state. The search process:

  • Selection: Use UCB policy to traverse promising branches
  • Expansion: Generate multiple candidate next steps via LLM sampling
  • Simulation: Roll out to completion and evaluate final outcomes
  • Backpropagation: Update node values based on outcome success

💡 This enables fine-grained credit assignment across reasoning steps, reducing the noise inherent in vanilla Monte Carlo estimation.

2 Tool-Augmented Step Verification

Each intermediate reasoning step is validated using an external verifier (e.g., Wolfram Alpha for math problems):

  • Execution-Based Grounding: Convert each step into a math query and verify with an external tool
  • Binary Correctness Signal: Check if intermediate results match expected values
  • Factual Fidelity: Eliminate hallucinated supervision with tool-verified signals

💡 Unlike LLM-based self-evaluation, tool verification provides ground-truth correctness signals at each step.

3 Hybrid Reward Aggregation & Generative Training

We combine tool-verified step feedback with overall outcome correctness and express rewards in natural language for generative supervision:

  • Hybrid Signal: Combine tool-verified steps with outcome feedback for stable and reliable rewards.
  • Rationale Generation: Turn step judgments into concise natural language explanations.
  • Generative Objective: Train PRM to produce reward labels and rationales via language modeling.
  • Inference: Use predicted rewards to guide reasoning and choose better solutions.

💡 This generative formulation enhances interpretability and ensures compatibility with instruction-tuned LLMs.

Main Results

ProcessBench Evaluation

ProcessBench Evaluation Results
Table 1: Main Results on ProcessBench. We compare GroundedPRM with various baseline methods across multiple mathematical reasoning datasets (GSM8K, MATH, Olympiad, Omni-MATH). Our method achieves +26.0% relative improvement in average F1 score (39.7 vs 31.5) compared to the best auto-labeled baseline (Math-Shepherd-PRM-7B), despite using only 10% of the training data. Notably, GroundedPRM outperforms even methods trained with costly human-labeled supervision, demonstrating the effectiveness of our tree-guided, tool-verified approach for automatic process reward annotation.

Reward-Guided Greedy Search

Reward-Guided Search Performance
Figure 2: Reward-Guided Greedy Search Performance. Accuracy of reward-guided greedy search using different PRMs to supervise the Qwen2.5-7B-Instruct policy model. GroundedPRM outperforms all PRMs trained with human, mixed, or automated labels, achieving the highest average accuracy (42.4%) across six datasets (AMC23, AIME24, MATH, College, Olympiad, and Minerva). Notably, despite being trained on only 40K automatically labeled samples, it surpasses PRMs trained with costly human annotations, demonstrating the effectiveness of its tree-guided, tool-verified, and rationale-enhanced supervision framework.

Qualitative Analysis

GroundedPRM provides interpretable, step-by-step evaluations of reasoning processes. Below are example cases showing how our model identifies errors and assigns rewards at each reasoning step.

Qualitative Analysis Examples
Figure 3: Error Detection via Tool-Augmented Verification. We present a concrete case study demonstrating how GroundedPRM identifies reasoning errors that are missed by LLM-based judges. In this example, the model incorrectly calculates the sum of Sally's quiz scores (claims 50+80+80+60+40+90+100+70+60 = 570, but the correct sum is 630). While GPT-4o as a judge fails to detect this arithmetic error and incorrectly validates the step as correct (marked with ), GroundedPRM's tool-based verification successfully catches the mistake (marked with ) through actual execution. The model generates an interpretable rationale explaining why the step is incorrect and assigns a negative reward. This demonstrates a key advantage of our approach: external tool verification provides more reliable, execution-grounded supervision than LLM self-evaluation, which is prone to hallucination and cannot verify computational correctness.

BibTeX

@misc{zhang2025groundedprmtreeguidedfidelityawareprocess,
      title={GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning},
      author={Yao Zhang and Yu Wu and Haowei Zhang and Weiguo Li and Haokun Chen and Jingpei Wu and Guohao Li and Zhen Han and Volker Tresp},
      year={2025},
      eprint={2510.14942},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.14942},
}