GroundedPRM: Tree-Guided and Fidelity-Aware
Process Reward Modeling for Step-Level Reasoning

Yao Zhang^1,7, Yu Wu², Haowei Zhang³, Weiguo Li⁴, Haokun Chen¹, Jingpei Wu¹, Guohao Li⁵, Zhen Han⁶, Volker Tresp^1,7

¹ LMU Munich ² Technical University of Munich ³ Fudan University ⁴ University of Heidelberg

⁵ University of Oxford ⁶ AWS AI ⁷ Munich Center for Machine Learning (MCML)

Abstract

Process Reward Models (PRMs) aim to improve multi-step reasoning in Large Language Models (LLMs) by supervising intermediate steps and identifying errors throughout the reasoning process. However, building effective PRMs remains challenging due to the lack of scalable, high-quality annotations. Existing approaches rely on costly human labeling, LLM-based self-evaluation that is prone to hallucination, or Monte Carlo (MC) estimation, which infers step quality solely from rollout outcomes and often introduces noisy, misaligned supervision due to credit misattribution. These issues result in three core limitations: noisy rewards, low factual fidelity, and misalignment with step-level reasoning objectives. To address these challenges, we introduce GroundedPRM, a tree-guided and fidelity-aware framework for automatic process supervision. To reduce reward noise and enable fine-grained credit assignment, we construct structured reasoning paths via Monte Carlo Tree Search (MCTS). To eliminate hallucinated supervision, we validate each intermediate step using an external tool, providing precise, execution-grounded correctness signals. To combine both step-level validation and global outcome assessment, we design a hybrid reward aggregation mechanism that fuses tool-based verification with MCTS-derived feedback. Finally, we format the reward signal into a rationale-enhanced, generative structure to promote interpretability and compatibility with instruction-tuned LLMs. GroundedPRM is trained on only 40K automatically labeled samples, amounting to just 10% of the data used by the best-performing PRM trained with auto-labeled supervision. Nevertheless, it achieves up to a 26% relative improvement in average performance on ProcessBench. When used for reward-guided greedy search, GroundedPRM outperforms even PRMs trained with human-labeled supervision, offering a scalable and verifiable path toward high-quality process-level reasoning.

Key Contributions

🎯 Addresses Three Core Challenges

We tackle noisy rewards, low factual fidelity, and misalignment in process supervision through a unified framework combining tree search and tool verification.

🔬 Novel Framework Design

First work to combine MCTS-guided path construction with execution-based verification for automatic, high-fidelity process reward annotation.

📊 Superior Data Efficiency

Achieves SOTA performance with only 10% training data compared to existing auto-labeled methods, demonstrating exceptional sample efficiency.

🚀 Strong Empirical Results

+26% relative improvement on ProcessBench average F1, outperforming even human-labeled PRMs in reward-guided search applications.

Method Overview

GroundedPRM Framework Overview — **Figure 1: Overview of the GroundedPRM Framework.** GroundedPRM constructs reasoning paths via MCTS, where each node corresponds to an LLM-generated step. During simulation, intermediate steps are verified using an external tool, and final answers are checked against ground truth. Step-level and outcome-level correctness signals are aggregated into a rollout reward, which is backpropagated along the tree to update node statistics; the next node is then selected by UCT, continuing the MCTS search until convergence or budget exhaustion. The framework enables verifiable, interpretable, and structure-aware process supervision for multi-step reasoning. The generative rationale provides interpretable feedback for each step.

Pipeline

GroundedPRM generates high-quality process supervision through a three-stage pipeline:

1 Tree-Guided Path Construction via MCTS

We build structured reasoning trees using Monte Carlo Tree Search, where each node represents an intermediate reasoning state. The search process:

Selection: Use UCB policy to traverse promising branches
Expansion: Generate multiple candidate next steps via LLM sampling
Simulation: Roll out to completion and evaluate final outcomes
Backpropagation: Update node values based on outcome success

💡 This enables fine-grained credit assignment across reasoning steps, reducing the noise inherent in vanilla Monte Carlo estimation.

2 Tool-Augmented Step Verification

Each intermediate reasoning step is validated using an external verifier (e.g., Wolfram Alpha for math problems):

Execution-Based Grounding: Convert each step into a math query and verify with an external tool
Binary Correctness Signal: Check if intermediate results match expected values
Factual Fidelity: Eliminate hallucinated supervision with tool-verified signals

💡 Unlike LLM-based self-evaluation, tool verification provides ground-truth correctness signals at each step.

3 Hybrid Reward Aggregation & Generative Training

We combine tool-verified step feedback with overall outcome correctness and express rewards in natural language for generative supervision:

Hybrid Signal: Combine tool-verified steps with outcome feedback for stable and reliable rewards.
Rationale Generation: Turn step judgments into concise natural language explanations.
Generative Objective: Train PRM to produce reward labels and rationales via language modeling.
Inference: Use predicted rewards to guide reasoning and choose better solutions.

💡 This generative formulation enhances interpretability and ensures compatibility with instruction-tuned LLMs.

Main Results

ProcessBench Evaluation

Reward-Guided Greedy Search

Reward-Guided Search Performance — **Figure 2: Reward-Guided Greedy Search Performance.** Accuracy of reward-guided greedy search using different PRMs to supervise the Qwen2.5-7B-Instruct policy model. GroundedPRM **outperforms all PRMs trained with human, mixed, or automated labels, achieving the highest average accuracy (42.4%)** across six datasets (AMC23, AIME24, MATH, College, Olympiad, and Minerva). Notably, despite being trained on only 40K automatically labeled samples, it surpasses PRMs trained with costly human annotations, demonstrating the effectiveness of its tree-guided, tool-verified, and rationale-enhanced supervision framework.

Qualitative Analysis

GroundedPRM provides interpretable, step-by-step evaluations of reasoning processes. Below are example cases showing how our model identifies errors and assigns rewards at each reasoning step.

BibTeX

@misc{zhang2025groundedprmtreeguidedfidelityawareprocess,
      title={GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning},
      author={Yao Zhang and Yu Wu and Haowei Zhang and Weiguo Li and Haokun Chen and Jingpei Wu and Guohao Li and Zhen Han and Volker Tresp},
      year={2025},
      eprint={2510.14942},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.14942},
}

More Works

WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration

SwarmAgentic: Towards Fully Automated Agentic System Generation via Swarm Intelligence

GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning