Abstract
Process Reward Models (PRMs) aim to improve multi-step reasoning in Large Language Models (LLMs) by supervising intermediate steps and identifying errors throughout the reasoning process. However, building effective PRMs remains challenging due to the lack of scalable, high-quality annotations. Existing approaches rely on costly human labeling, LLM-based self-evaluation that is prone to hallucination, or Monte Carlo (MC) estimation, which infers step quality solely from rollout outcomes and often introduces noisy, misaligned supervision due to credit misattribution. These issues result in three core limitations: noisy rewards, low factual fidelity, and misalignment with step-level reasoning objectives. To address these challenges, we introduce GroundedPRM, a tree-guided and fidelity-aware framework for automatic process supervision. To reduce reward noise and enable fine-grained credit assignment, we construct structured reasoning paths via Monte Carlo Tree Search (MCTS). To eliminate hallucinated supervision, we validate each intermediate step using an external tool, providing precise, execution-grounded correctness signals. To combine both step-level validation and global outcome assessment, we design a hybrid reward aggregation mechanism that fuses tool-based verification with MCTS-derived feedback. Finally, we format the reward signal into a rationale-enhanced, generative structure to promote interpretability and compatibility with instruction-tuned LLMs. GroundedPRM is trained on only 40K automatically labeled samples, amounting to just 10% of the data used by the best-performing PRM trained with auto-labeled supervision. Nevertheless, it achieves up to a 26% relative improvement in average performance on ProcessBench. When used for reward-guided greedy search, GroundedPRM outperforms even PRMs trained with human-labeled supervision, offering a scalable and verifiable path toward high-quality process-level reasoning.
Key Contributions
🎯 Addresses Three Core Challenges
We tackle noisy rewards, low factual fidelity, and misalignment in process supervision through a unified framework combining tree search and tool verification.
🔬 Novel Framework Design
First work to combine MCTS-guided path construction with execution-based verification for automatic, high-fidelity process reward annotation.
📊 Superior Data Efficiency
Achieves SOTA performance with only 10% training data compared to existing auto-labeled methods, demonstrating exceptional sample efficiency.
🚀 Strong Empirical Results
+26% relative improvement on ProcessBench average F1, outperforming even human-labeled PRMs in reward-guided search applications.
Method Overview
Pipeline
GroundedPRM generates high-quality process supervision through a three-stage pipeline:
1 Tree-Guided Path Construction via MCTS
We build structured reasoning trees using Monte Carlo Tree Search, where each node represents an intermediate reasoning state. The search process:
- Selection: Use UCB policy to traverse promising branches
- Expansion: Generate multiple candidate next steps via LLM sampling
- Simulation: Roll out to completion and evaluate final outcomes
- Backpropagation: Update node values based on outcome success
💡 This enables fine-grained credit assignment across reasoning steps, reducing the noise inherent in vanilla Monte Carlo estimation.
2 Tool-Augmented Step Verification
Each intermediate reasoning step is validated using an external verifier (e.g., Wolfram Alpha for math problems):
- Execution-Based Grounding: Convert each step into a math query and verify with an external tool
- Binary Correctness Signal: Check if intermediate results match expected values
- Factual Fidelity: Eliminate hallucinated supervision with tool-verified signals
💡 Unlike LLM-based self-evaluation, tool verification provides ground-truth correctness signals at each step.
3 Hybrid Reward Aggregation & Generative Training
We combine tool-verified step feedback with overall outcome correctness and express rewards in natural language for generative supervision:
- Hybrid Signal: Combine tool-verified steps with outcome feedback for stable and reliable rewards.
- Rationale Generation: Turn step judgments into concise natural language explanations.
- Generative Objective: Train PRM to produce reward labels and rationales via language modeling.
- Inference: Use predicted rewards to guide reasoning and choose better solutions.
💡 This generative formulation enhances interpretability and ensures compatibility with instruction-tuned LLMs.
Main Results
ProcessBench Evaluation
Reward-Guided Greedy Search
Qualitative Analysis
GroundedPRM provides interpretable, step-by-step evaluations of reasoning processes. Below are example cases showing how our model identifies errors and assigns rewards at each reasoning step.
BibTeX
@misc{zhang2025groundedprmtreeguidedfidelityawareprocess,
title={GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning},
author={Yao Zhang and Yu Wu and Haowei Zhang and Weiguo Li and Haokun Chen and Jingpei Wu and Guohao Li and Zhen Han and Volker Tresp},
year={2025},
eprint={2510.14942},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.14942},
}