Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) improves final-answer accuracy on reasoning tasks, but it does not reliably improve reasoning quality. Because outcome rewards only assess final answers, they also reward spurious successes: flawed reasoning can still receive maximal reward when it accidentally reaches the correct outcome. This outcome reward hacking creates biased gradients, making current RLVR insufficient for learning faithful reasoning. Process Reward Models (PRMs) provide step-wise supervision, but directly optimizing PRMs or naively combining them with outcome rewards is unstable under distribution shift during RL training process. We introduce PRocess cOnsistency Filter (PROF), a data curation method that uses PRM–ORM consistency for sample selection rather than direct reward optimization. PROF keeps correct responses with strong process support and incorrect responses with weak process support while maintaining a balanced training ratio. Experiments show that PROF consistently improves both final-answer accuracy and intermediate reasoning quality over strong baselines, with less dependence on strong PRMs. Codes and training recipes are available at https://github.com/amazon-science/PROF-GRPO.
Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training
Chenlu Ye1,2††thanks: Email: chenluy3@illinois.edu Zhou Yu1 Ziji Zhang1 Hao Chen1 Narayanan Sadagopan1
Jing Huang1 Tong Zhang2 Anurag Beniwal1 1Amazon 2University of Illinois Urbana-Champaign
1 Introduction
Verifiable rewards have attracted substantial attention because they can reliably improve performance on reasoning tasks with easily verifiable outcomes, such as mathematical and coding problems (Cobbe et al., 2021; Jaech et al., 2024; Shao et al., 2024; Xiong et al., 2025b). However, success on these tasks is usually measured only by the final answer, while in many applications we also care about the quality of the reasoning process itself, especially its faithfulness, validity, and interpretability. Throughout this paper, we use reasoning quality as an umbrella term for these process-level properties. Optimizing the verifier is therefore not the same as optimizing reasoning quality. Because verifiers only assess final outcomes, Outcome Reward Models (ORMs) are too sparse and coarse to distinguish flawed reasoning within correct answers or valid reasoning within incorrect answers. For instance, the training example in Table 1 has fundamentally invalid reasoning but still arrives at the correct answer. To theoretically analyze this challenge, we define a latent state variable , where denotes a valid intermediate reasoning process (i.e., no error). Let represent the probability of generating valid reasoning. Given that an incorrect process () may coincidentally yield a correct answer () with a small probability (i.e., ), the expected reward can be decomposed as:
While the ideal objective is to maximize , the term introduces gains from spurious successes. During training, samples where despite generate biased gradients that inadvertently reinforce flawed reasoning paths, allowing the policy to increase outcome reward without improving the latent reasoning quality. This creates a process-outcome mismatch: final-answer correctness no longer reliably reflects reasoning quality, especially the faithfulness of the underlying reasoning process. We refer to the resulting optimization failure as outcome reward hacking: the model is rewarded for exploiting weaknesses in outcome-only supervision rather than for producing faithful reasoning. This misalignment leads to unfaithful reasoning, a limitation increasingly observed in recent studies (Baker et al., 2025; Chen et al., 2025b). Consequently, relying solely on final answer accuracy is insufficient; ensuring reasoning quality, faithfulness, and interpretability in Chain of Thought (CoT) is crucial for the safety and practical utility of LLMs (Zhu et al., 2025; Lyu et al., 2023; Yeo et al., 2024). To empirically support this process-outcome mismatch and the resulting reasoning-quality gap, we later analyze 2k samples from Qwen2.5-Math-7B and find that of correct responses still contain flawed reasoning, as judged by Claude. Within this flawed-correct subset, PROF identifies and filters (Figure 1).
This process-outcome mismatch shows that current RLVR alone cannot solve the reasoning-quality gap: outcome rewards are necessary for verifiability, but insufficient for supervising how the answer is reached. This has motivated a flurry of recent work on training Process Reward Models (PRMs) and using them in RL training (Lightman et al., 2023; Zhang et al., 2025; Zou et al., 2025), since PRMs provide dense and fine-grained feedback over intermediate reasoning processes. In other words, if we want to optimize reasoning quality rather than final correctness alone, some form of PRM-style process supervision is necessary. However, directly using PRMs as rewards introduces a second failure mode. Although these PRMs achieve excellent performance on PRM benchmarks, directly combining PRM and ORM in the reward function can lead to reward hacking. Notably, since PRMs are often trained offline, applying them to online training suffers from distribution shift. Especially in boundary cases where the policy encounters difficult problems and produces rarely seen responses, PRMs often fail to judge them correctly, thus leading to severe reward hacking when they are used as explicit reward signals during RL training (Michaud et al., 2020; Tien et al., 2022). Even when some works (Zha et al., 2025; Cui et al., 2025) attempt to co-train the policy and PRMs online, they can only train PRMs in implicit ways that lack accurate process scores, such as implicit generative rewards or alignment between process rewards and outcomes. Therefore, instead of training another PRM for a specific dataset or base model, we focus on how to robustly integrate a pre-trained PRM into online training, i.e., how to harmonize accurate but coarse-grained ORMs with fine-grained but noisy Process Reward Models (PRMs) in Reinforcement Learning (RL).
We develop a PRocess cOnsistency Filter (PROF) framework, an online data curation strategy based on process-outcome consistency. PROF oversamples responses at training time and then ranks and filters them by PRM–ORM consistency. Specifically, it removes samples where the process and outcome signals conflict, such as correct responses derived from flawed reasoning or incorrect responses that contain sound reasoning steps. By using PRMs for filtering rather than as direct optimization targets, PROF injects process supervision into RLVR while avoiding the instability of explicit PRM reward maximization. Furthermore, because correct and incorrect responses have different consistency distributions, we rank each group separately to maintain a balanced training ratio. PROF is a modular framework that can be combined with RL algorithms like Group Relative Policy Optimization (GRPO) for online training.
We conduct extensive experiments to validate the improvement of PROF on both outcome accuracy and reasoning quality using both Qwen (Yang et al., 2024) and LLaMA (Dubey et al., 2024) models. To summarize, we highlight our key contributions as follows:
-
•
We identify a fundamental reasoning-quality gap in current RLVR. Because outcome-only rewards can reward spurious successes, current RLVR can improve final-answer accuracy without reliably improving faithful reasoning, a failure mode we characterize as outcome reward hacking. We support this process-outcome mismatch with both theoretical analysis and empirical evidence.
-
•
We propose PROF, a consistency-based data curation framework that robustly injects PRM supervision into RLVR. Rather than directly optimizing PRM scores or naively blending PRM and ORM rewards, PROF uses PRM–ORM consistency for ranking and filtering, allowing it to remove conflicting trajectories while maintaining a balanced correct/incorrect training ratio.
-
•
Extensive experiments and ablations on both Qwen and LLaMA models show that PROF consistently improves both final-answer accuracy and intermediate reasoning quality over strong baselines, with less dependence on strong PRMs. Under matched compute cost and matched rollout-group size, PROF still achieves larger gains by almost . We further demonstrate robustness to different off-the-shelf PRMs, generality beyond GRPO, and the importance of filtering correct and incorrect responses separately.
2 Related Work
Reasoning-Quality Gaps and Faithfulness of Chain-of-Thought.
A growing literature documents process-outcome mismatch in language models: final-answer correctness can diverge substantially from reasoning quality, especially from the faithfulness of a model’s verbalized reasoning. Turpin et al. (2023) show that CoT explanations can omit biasing features and rationalize incorrect predictions, while Lyu et al. (2023) argue that standard CoT does not guarantee a faithful explanation of how the answer is produced. Subsequent work measures this reasoning-quality gap more directly: Nguyen et al. (2024) report a significant disparity between answer accuracy and CoT faithfulness in multi-hop question answering, and Paul et al. (2024) use causal mediation analysis to show that LLMs do not reliably use their generated intermediate steps when producing the final answer. Beyond faithfulness, Yeo et al. (2024) advocate evaluating reasoning explanations along multiple axes including robustness and utility, and Jacovi et al. (2024) show that even dedicated verifiers struggle to detect logical errors and contradictions inside reasoning chains. Recent monitoring work extends these concerns to reasoning models themselves: Baker et al. (2025); Chen et al. (2025b) show that model-generated reasoning often fails to transparently reveal the cues or considerations that drive behavior. Relative to this line of work, we focus on the RLVR setting and study how process supervision can reduce process-outcome mismatch during online training by filtering trajectories whose final outcomes and reasoning quality are inconsistent.
Process-Supervised Reward Models for Fine-Grained Feedback.
RLHF focuses on trajectory-level comparison under the Bradley-Terry model. For reasoning-related tasks, Yang et al. (2024) uses final-answer correctness to construct preference pairs and trains Bradley-Terry reward models for mathematical reasoning. A more widely used approach, termed Outcome Reward Models (ORMs), trains a classifier to predict whether the final answer is correct based on the reasoning history. However, Lightman et al. (2023) show that Process-Supervised Reward Models (PRMs), which evaluate each intermediate step of a reasoning chain, significantly outperform ORMs, especially for data selection tasks such as best-of-n sampling (Lightman et al., 2023). Their approach, however, requires human annotators to label each intermediate step. Wang et al. (2023) proposes using Monte-Carlo estimation of the Q value to determine labels automatically. Many follow-up works improve PRMs through generative reward modeling, advanced training techniques such as RL, and refined engineering practices (Xiong et al., 2024b; Zhang et al., 2025; Khalifa et al., 2025; Zhao et al., 2025; Xiong et al., 2025c). Our work does not focus on improving PRMs themselves; instead, we use PRMs to supervise the intermediate steps of CoT trajectories for data filtering. We mainly use Qwen2.5-Math-PRM-7B from Zhang et al. (2025) because it is trained on the Qwen distribution and achieves strong performance on ProcessBench (Zheng et al., 2024).
Sample Filtering in Reinforcement Learning for LLM.
A key challenge in applying reinforcement learning to LLM applications is the imperfection of reward signals. These signals stem from a learned reward model, such as Reinforcement Learning from Human Feedback (RLHF), or are sparse, delivered only at the end of a trajectory (e.g. RLVR). In RLHF, the reward model is trained on human-annotated pairwise comparisons, typically using a Bradley-Terry model (Bradley and Terry, 1952). Due to inherent human disagreement and finite training data, the model develops shortcuts that RL algorithms can exploit (Lin et al., 2023; Eisenstein et al., 2023) to chase for a fake high reward. Consequently, these rewards may not fully align with the underlying intended goals, leading to reward hacking.
Data filtering has proven effective in mitigating reward hacking across RL-based LLM training. In RLHF, prior work filters preference pairs by reward gap (Yuan et al., 2024; Dong et al., 2024; Xiong et al., 2024a; Zhang et al., 2024) or combines reward with response length (Kim et al., 2024; Yu et al., 2025a) to retain samples that are more reliable under the learned reward model.
Filtering is also useful in RLVR despite the reward being available only at the final outcome. Rejection sampling fine-tuning discards incorrect trajectories and often approaches stronger RL baselines (Dong et al., 2023; Chen et al., 2025a; Xiong et al., 2025a). Other methods filter prompts by difficulty (Yang et al., 2024), remove zero-gradient prompts via dynamic sampling (Yu et al., 2025b), or over-sample and retain subsets that improve reward variance or the balance between correct and incorrect responses (Xiong et al., 2025a; Xu et al., 2025). In contrast to these methods, which mainly rely on coarse outcome-level signals, our approach uses process-supervised reward models (PRMs) (Lightman et al., 2023) to filter trajectories based on reasoning quality at the level of intermediate steps and their consistency with ORMs.
3 Formulation and Algorithm
An LLM defines a policy distribution: given a prompt , it assigns density to each response . For mathematical reasoning tasks with a binary verifiable reward, there exists a verifier mapping prompt-response pairs to a scalar reward . For each prompt, we generate a group of responses together with their verifier outcomes, denoted by .
| (1) |
-
•
PROF-POS: randomly pick samples from ;
-
•
PROF-BOTH: keep .
GRPO.
(Shao et al., 2024) proposes this policy gradient algorithm that simplifies the Proximal Policy Optimization (PPO) (Schulman et al., 2017) by only computing the advantage based on the outcome rewards in a group. Instead of maintaining and updating another value network, GRPO computes the advantage by standardizing the outcome rewards within a group: for ,
where is the outcome reward for a given response and is a small constant for numerical stability. Let denote the -th token and denote . This advantage is then incorporated into a clipped surrogate objective, which is optimized to update the policy from to :
Although this approach stabilizes the online policy optimization and is efficient, the sparse reward signal limits further improvement in intermediate reasoning quality.
Process Reward Model (PRM).
For a response composed of multiple reasoning steps , we follow previous works (Zheng et al., 2024; Zhang et al., 2025; Zou et al., 2025) to use a newline as a sign for a new step. For each step , the PRM score maps it, the previous steps and the prompt to a scalar , where we use the short-hand notation .
Our Method PROF: Process Consistency Filter Framework
We propose PROF in Algorithm 1 to robustly incorporate PRM–ORM consistency after the rollout phase, and also visualize it in Figure 2. First, we generate samples and obtain outcome rewards. Then, we call the PRM to generate step-level rewards for each rollout and compute the trajectory-wise consistency score by taking the mean over step-level rewards and adding a step-length regularization in equation 1, where is the regularization parameter and is the threshold for the penalized step number. This regularization ensures that samples with no step segments or over-long steps are discarded in the correct group. The samples are divided into two subgroups: contains correct samples with , and contains incorrect samples with . Inspired by (Xu et al., 2025), the numbers to keep in each subgroup, and , are chosen to maximize the outcome-reward variance of the final kept samples . Since is fixed, should be maximized, and the maximum is attained when is closest to under the constraint . This implies that the ratio of correct and incorrect responses should be balanced. After that, for the correct group, we use to rank and keep the top samples. For the incorrect group, PROF-POS randomly filters samples, while PROF-BOTH uses to rank and keep the bottom samples. Finally, we collect the kept trajectories for policy update.
False Positives Are Frequent and Filterable.
We provide empirical evidence in Figure 1 to justify the practical motivation. On 2k samples from Qwen2.5-Math-7B, we find that of correct responses still exhibit flawed reasoning, as judged by Claude. Crucially, within this flawed-correct subset, when PROF filters the bottom half of correct responses by PRM consistency, it identifies and removes of these flawed responses. This confirms that process-outcome mismatch is a critical bottleneck and that PROF effectively filters problematic samples to improve gradient quality.
4 Experiments
4.1 Setup
We focus on mathematical reasoning tasks in this work. For online training, we use the Numina-Math prompt set (Beeching et al., 2024), which contains nearly 860k math problems with ground-truth answers ranging from Chinese high school exercises to US and international mathematics olympiad problems. We use Qwen2.5-Math-1.5B-base and Qwen2.5-Math-7B-base (Yang et al., 2024) as the training base models. For the PRM, we mainly use Qwen2.5-Math-PRM-7B (Zhang et al., 2025) to generate process rewards. We also experiment with a weaker PRM, Skywork-PRM-1.5B (He et al., 2024b), to study the robustness of PROF to PRM quality. More details are provided in Appendix A. Model performance is evaluated on five benchmarks: Math500 (Hendrycks et al., 2021), Minerva Math (Lewkowycz et al., 2022), Olympiad Bench (He et al., 2024a), AMC2023111https://huggingface.co/datasets/math-ai/amc23, and AIME2024222https://huggingface.co/datasets/math-ai/aime24. We mainly use average@ for evaluation, i.e., accuracy averaged over responses per prompt under temperature . The models are allowed to generate tokens.
4.2 Main Results
| Model | Algorithm | Math500 | Minerva Math | Olympiad Bench | AIME24 | AMC23 | Average |
| Qwen2.5-Math- 1.5B-base | Base | 39.9 | 11.4 | 19.1 | 3.5 | 23.6 | 19.5 |
| GRPO | 70.3 | 29.1 | 33.0 | 9.0 | 44.5 | 37.2 | |
| Blend | 67.6 | 27.8 | 31.1 | 7.7 | 42.5 | 35.3 | |
| \cellcolorlightbluePROF-POS | \cellcolorlightblue72.6 | \cellcolorlightblue31.3 | \cellcolorlightblue36.1 | \cellcolorlightblue10.6 | \cellcolorlightblue50.3 | \cellcolorlightblue40.2 | |
| \cellcolorlightbluePROF-BOTH | \cellcolorlightblue73.2 | \cellcolorlightblue30.0 | \cellcolorlightblue36.1 | \cellcolorlightblue9.6 | \cellcolorlightblue49.1 | \cellcolorlightblue39.6 | |
| Qwen2.5-Math- 7B-base | Base | 42.0 | 12.8 | 19.2 | 12.9 | 30.0 | 23.4 |
| GRPO | 81.6 | 37.2 | 45.5 | 20.6 | 64.4 | 49.9 | |
| Blend | 81.7 | 36.7 | 45.0 | 15.2 | 58.0 | 47.3 | |
| \cellcolorlightbluePROF-POS | \cellcolorlightblue81.4 | \cellcolorlightblue36.6 | \cellcolorlightblue45.0 | \cellcolorlightblue24.8 | \cellcolorlightblue64.2 | \cellcolorlightblue50.6 | |
| \cellcolorlightbluePROF-BOTH | \cellcolorlightblue83.1 | \cellcolorlightblue39.0 | \cellcolorlightblue47.8 | \cellcolorlightblue17.5 | \cellcolorlightblue70.9 | \cellcolorlightblue51.7 | |
| LLaMA-3.2- 3B-instruct t | Base | 30.0 | 8.8 | 6.1 | 2.3 | 10.6 | 11.6 |
| GRPO | 50.5 | 18.8 | 17.9 | 5.0 | 25.6 | 23.6 | |
| Blend | 37.2 | 13.1 | 9.9 | 1.0 | 17.2 | 15.7 | |
| \cellcolorlightbluePROF-POS | \cellcolorlightblue52.4 | \cellcolorlightblue19.5 | \cellcolorlightblue19.8 | \cellcolorlightblue6.7 | \cellcolorlightblue28.6 | \cellcolorlightblue25.4 | |
| \cellcolorlightbluePROF-BOTH | \cellcolorlightblue49.0 | \cellcolorlightblue18.0 | \cellcolorlightblue17.3 | \cellcolorlightblue5.4 | \cellcolorlightblue23.9 | \cellcolorlightblue22.7 |
We summarize our main results in Table 2, where Blend denotes a common way that mixes the PRM with outcome rewards (Zha et al., 2025; Cui et al., 2025; Zou et al., 2025). Following (Zou et al., 2025), the PRMs are averaged over steps for each response, weighted by a parameter , and added to outcome rewards. We use parameter according to Table 5 of (Zou et al., 2025). Our main findings are as follows.
PROF Improves Accuracy under Standard and Matched-Cost Comparisons.
As shown in Table 2, our proposed methods, PROF-POS and PROF-BOTH, consistently outperform GRPO and Blend-PRM-GRPO across benchmarks and base models. For models starting from Qwen2.5-Math-1.5B-base, PROF-POS and PROF-BOTH achieve average accuracies of and , surpassing the standard GRPO baseline () and Blend-PRM-GRPO (). A similar trend is observed with Qwen2.5-Math-7B-base, where PROF-POS and PROF-BOTH achieve and average accuracies, significantly above GRPO’s and Blend-PRM-GRPO’s . Moreover, for LLaMA-3.2-3B-instruct, whose policy distribution differs from the Qwen family, Blend performs even worse than GRPO, while PROF-POS still outperforms the baseline by . The learning dynamics in Figure 4 corroborate these findings, illustrating that PROF steadily maintains a consistent performance advantage over both GRPO and Blend-PRM-GRPO throughout training, with faster convergence and higher final accuracy than GRPO.
To further address efficiency and fairness concerns, we increase the rollout group size and policy update group size , and compare GRPO- with PROF- on Qwen2.5-Math-7B-base under matched compute cost. As shown in Figure 3, PROF achieves larger gains than GRPO at the same cost level. We compute average cost as Inference Train PRM, where the factor is a rough FLOPs proxy for training relative to a forward pass. We further aggregate all five benchmarks by base-model pass rate into four difficulty levels: Level 1 (), Level 2 (), Level 3 (), and Level 4 (). PROF’s gain is especially pronounced on harder problems, plausibly because easier problems usually involve shorter and simpler reasoning with fewer flaws, making improvements smaller and more sensitive to PRM noise, whereas harder problems rely much more on PRM’s ability to distinguish trajectory quality. Due to space constraints, Figure 3 visualizes only Level 4 in the main text, while matched-cost curves for Levels 1–3 are provided in Appendix Figure 8.
Filtration Method is Much More Robust than Blending.
We plot the entropy loss and response length curves of GRPO, Blend-PRM-GRPO, and PROF in Figure 7. Blend-PRM-GRPO suffers from severe reward hacking because its entropy collapses quickly toward zero. Simultaneously, its response length in the right plot increases uncontrollably, indicating that the model has learned to game the PRM by over-generating verbose and repetitive steps to obtain a higher averaged process reward. As a result, Blend-PRM-GRPO’s test accuracy even falls below GRPO. In contrast, PROF maintains a gradual and slightly faster decrease in entropy loss together with controlled response-length growth. This illustrates that our filtration method effectively leverages the PRM signal while staying robust to reward hacking. Below, we further analyze the quality of intermediate reasoning steps.
4.3 PROF Improves Reasoning Process Quality
PROF Improves Reasoning Consistency.
To evaluate the quality of intermediate steps, we adopt Monte Carlo (MC) estimation, a common way to estimate the probability of reaching correct final answers (Wang et al., 2023; Xiong et al., 2024a; Luo et al., 2024). For this analysis, we select problem-response pairs from test prompts where our method and GRPO both produced the correct final answer. Both models were initialized from Qwen2.5-Math-7B-base. To estimate the value of each reasoning step, we generate eight independent completions from that point using a temperature of 1.0, and the resulting empirical success rate serves as the MC value. In Figure 5 (left), the average MC estimates across all five benchmarks are consistently higher for our model. The specific improvement gaps are on Math500, on Minerva Math, on Olympiad Bench, on AMC2023, and on AIME2024, which are much larger than the outcome-accuracy gap in Table 2.
PROF Reduces Flawed Reasoning within Correct Responses.
As a more direct faithfulness metric, we audit correct responses on the test set with Claude Sonnet 4.6 and ask whether the reasoning process contains any flaw (e.g., logical or arithmetic errors), even when the final answer is correct. The audit prompt is provided in Appendix A.2. In Figure 5 (right), the flawed-reasoning rate within correct responses decreases from for GRPO to for PROF. This complements Figure 1: Figure 1 measures flawed-reasoning prevalence in base-model outputs before RL (about , specifically ), whereas Figure 5 reports the same notion after training, where both methods fall below and PROF remains lower. We also note that Claude-based auditing is still an approximate signal of reasoning quality and cannot fully replace careful human judgment on step granularity, subtle unsupported jumps, or the level of detail. Therefore, we additionally provide qualitative response comparisons in Figures 12 and 13. These examples consistently show that PROF produces concrete and verifiable intermediate deductions, GRPO tends to skip key steps, and Blend-PRM-GRPO is often verbose but less reliable in core calculations.
Additional process metrics on Math500 (step counts and averaged PRM scores) are moved to Appendix Figure 10. The key takeaway is that PROF improves process quality under both MC-based estimation and direct flaw auditing.
5 Ablations
5.1 Robustness to PRM Capability
| Algorithm | Math500 | Minerva Math | Olympiad Bench | AIME24 | AMC23 | Average |
| GRPO | 81.6 | 37.2 | 45.5 | 20.6 | 64.4 | 49.9 |
| Blend (PRM-7B) | 81.7 | 36.7 | 45.0 | 15.2 | 58.0 | 47.3 |
| PROF (PRM-7B) | 83.1 | 39.0 | 47.8 | 17.5 | 70.9 | 51.7 |
| Blend (PRM-1.5B) | 81.1 | 37.8 | 44.1 | 11.7 | 62.8 | 47.5 |
| PROF-POS (PRM-1.5B) | 82.9 | 39.4 | 47.4 | 19.2 | 66.1 | 51.0 |
| PROF-BOTH (PRM-1.5B) | 83.2 | 38.8 | 47.8 | 17.5 | 65.0 | 50.5 |
To showcase PROF’s robustness to PRM quality, we use a weaker and smaller Skywork-PRM-1.5B (He et al., 2024b) while training from Qwen2.5-Math-7B-base. The results in Table 3 validate that when using a weaker PRM, Blend achieves lower accuracies, while PROF still maintains performance close to the model trained with the 7B PRM. This finding further corroborates the robustness of our algorithm.
5.2 Generality beyond GRPO: RAFT++
To demonstrate that PROF is a general filtration framework, we extend our experiments to RAFT++ (Xiong et al., 2025a), a rejection-sampling-based online training paradigm that only trains on positive samples. We compare PROF-Raft++ against standard RAFT++ baselines with different rollout budgets in Table 4. PROF-Raft++ not only outperforms the standard Raft++- baseline, but also significantly surpasses Raft++-. Since RAFT++ only uses positive samples and does not involve negative samples, this comparison is primarily influenced by the number and quality of positive trajectories. Therefore, PROF’s priority-based filtration is algorithm-agnostic and consistently identifies high-quality reasoning paths that lead to better policy improvement, regardless of the underlying RL objective.
| Method | Average score |
| Raft++- | 35.27 |
| Raft++- | 37.75 |
| PROF-Raft++ () | 39.29 |
5.3 Separating Correct and Incorrect
We first test a no-separation variant (PROF w/o separation) that ranks all rollouts together. To mitigate PRM scale bias, we center each step score by subtracting the batch mean. Even with centering, the rightmost plot in Figure 6 shows that PROF w/o separation has over gap between rewards before and after filtering, indicating disproportionate removal of negative samples. A likely reason is that incorrect responses often contain several locally correct steps, which can inflate averaged PRM scores and blur process-outcome consistency. Separating correct and incorrect groups alleviates this bias.
We then compare three variants: PROF-POS (consistency filtering on correct group only), PROF-NEG (incorrect group only), and Filter-Random (random filtering on both groups) (Xu et al., 2025). As shown in Figure 6, PROF-POS and PROF-BOTH are the best-performing strategies across both 1.5B and 7B settings; PROF-BOTH is typically more sample-efficient, PROF-NEG is weaker, Filter-Random is only slightly above GRPO, and w/o separation is the worst. These results suggest that preserving quality in correct responses is the dominant factor, while consistency control on incorrect responses is secondary. More filtration ablations are provided in Appendix B.
This ablation highlights a practical trade-off between PROF-BOTH and PROF-POS. PROF-BOTH usually converges faster by using consistency signals from both groups, while PROF-POS can be more robust when PRM reliability is weaker or distribution shift is larger, since it avoids tightly shaping the incorrect group with noisy estimates. In both cases, improving correct trajectories is the main driver, and filtering incorrect trajectories mainly affects efficiency and stability.
6 Conclusion and Future Work
This work introduces Process Consistency Filter (PROF), a data curation technique that filters generated responses based on PRM–ORM consistency while maintaining a balanced correct/incorrect ratio. We demonstrate that PROF consistently improves final-answer accuracy and shapes the policy to generate more detailed and fine-grained intermediate reasoning steps. PROF is also a general filtration framework rather than one tied to a specific PRM or RL objective. Thus, using pre-trained PRMs in our experiments is not a limitation; instead, it highlights the robustness of our algorithm to different PRMs and suggests that training a task-specific PRM for each base model is unnecessary. Exploring stronger or more diverse PRMs, and extending PROF to other reasoning tasks such as coding (Jimenez et al., 2023) and web navigation (Zhou et al., 2023), remains important future work.
Broader Impact and Ethics Statement
Our work contributes to AI safety by enhancing the faithfulness and interpretability of chain-of-thought reasoning, mitigating the risk of misleading hallucinations. However, we acknowledge two potential risks. First, the reliance on oversampling and dense process reward computation increases computational overhead and environmental impact compared to standard baselines. Second, our filtration mechanism depends on pre-trained Process Reward Models (PRMs); if these PRMs harbor biases toward specific reasoning patterns or languages, our method may inadvertently amplify such biases by filtering out diverse but valid solutions. We encourage future research to address these efficiency and fairness challenges.
Limitations
Although PROF can effectively improve robustness to PRM noise and increase reasoning-step quality, our method requires more computation than Blend or vanilla GRPO because it first oversamples and then filters. How to balance efficiency and reasoning quality remains an important direction for future work. Finally, we acknowledge the use of AI assistants (e.g., ChatGPT) for grammatical error correction and polishing of the manuscript.
References
- Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926. Cited by: §1, §2.
- NuminaMath 7b cot. Numina Hugging Face. Note: https://huggingface.co/AI-MO/NuminaMath-7B-CoT Cited by: §4.1.
- Rank analysis of incomplete block designs: i. the method of paired comparisons. Biometrika 39 (3/4), pp. 324–345. Cited by: §2.
- Bridging supervised learning and reinforcement learning in math reasoning. arXiv preprint arXiv:2505.18116. Cited by: §2.
- Reasoning models don’t always say what they think. arXiv preprint arXiv:2505.05410. Cited by: §1, §2.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Cited by: §1.
- Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456. Cited by: §1, §4.2.
- Raft: reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767. Cited by: §2.
- Rlhf workflow: from reward modeling to online rlhf. arXiv preprint arXiv:2405.07863. Cited by: §2.
- The llama 3 herd of models. arXiv e-prints, pp. arXiv–2407. Cited by: §1.
- Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint arXiv:2312.09244. Cited by: §2.
- Olympiadbench: a challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008. Cited by: §4.1.
- Skywork-o1 open series. Zenodo. External Links: Document, Link Cited by: §4.1, §5.1.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874. Cited by: §4.1.
- A chain-of-thought is as strong as its weakest link: a benchmark for verifiers of reasoning chains. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: §2.
- Openai o1 system card. arXiv preprint arXiv:2412.16720. Cited by: §1.
- Swe-bench: can language models resolve real-world github issues?. arXiv preprint arXiv:2310.06770. Cited by: §6.
- Process reward models that think. arXiv preprint arXiv:2504.16828. Cited by: §2.
- " I’m not sure, but…": examining the impact of large language models’ uncertainty expression on user reliance and trust. In Proceedings of the 2024 ACM conference on fairness, accountability, and transparency, pp. 822–835. Cited by: §2.
- Solving quantitative reasoning problems with language models. Advances in neural information processing systems 35, pp. 3843–3857. Cited by: §4.1.
- Let’s verify step by step. In The Twelfth International Conference on Learning Representations, Cited by: §1, §2, §2.
- Mitigating the alignment tax of rlhf. arXiv preprint arXiv:2309.06256. Cited by: §2.
- Improve mathematical reasoning in language models by automated process supervision. External Links: 2406.06592, Link Cited by: §4.3.
- Faithful chain-of-thought reasoning. In The 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL 2023), Cited by: §1, §2.
- Understanding learned reward functions. arXiv preprint arXiv:2012.05862. Cited by: §1.
- Direct evaluation of chain-of-thought in multi-hop reasoning with knowledge graphs. In Findings of the Association for Computational Linguistics: ACL 2024, Cited by: §2.
- Making reasoning matter: measuring and improving faithfulness of chain-of-thought reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2024, Cited by: §2.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §3.
- Deepseekmath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §1, §3.
- Hybridflow: a flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pp. 1279–1297. Cited by: §A.1.
- Causal confusion and reward misidentification in preference-based reward learning. arXiv preprint arXiv:2204.06601. Cited by: §1.
- Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023), Cited by: §2.
- Math-shepherd: verify and reinforce llms step-by-step without human annotations. arXiv preprint arXiv:2312.08935. Cited by: §2, §4.3.
- Building math agents with multi-turn iterative preference learning. arXiv preprint arXiv:2409.02392. Cited by: §2, §4.3.
- A minimalist approach to llm reasoning: from rejection sampling to reinforce. arXiv preprint arXiv:2504.11343. Cited by: §2, §5.2.
- An implementation of generative prm. Cited by: §2.
- Self-rewarding correction for mathematical reasoning. arXiv preprint arXiv:2502.19613. Cited by: §1.
- StepWiser: stepwise generative judges for wiser reasoning. External Links: 2508.19229, Link Cited by: §2.
- Not all rollouts are useful: down-sampling rollouts in llm reinforcement learning. arXiv preprint arXiv:2504.13818. Cited by: §2, §3, §5.3.
- Qwen2. 5-math technical report: toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122. Cited by: §1, §2, §2, §4.1.
- How interpretable are reasoning explanations from prompting large language models?. In Findings of the Association for Computational Linguistics: NAACL 2024, Cited by: §1, §2.
- Rip: better models by survival of the fittest prompts. arXiv preprint arXiv:2501.18578. Cited by: §2.
- Dapo: an open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476. Cited by: §A.1, §2.
- Self-rewarding language models. arXiv preprint arXiv:2401.10020 3. Cited by: §2.
- RL tango: reinforcing generator and verifier together for language reasoning. arXiv preprint arXiv:2505.15034. Cited by: §1, §4.2.
- Policy filtration in rlhf to fine-tune llm for code generation. Cited by: §2.
- The lessons of developing process reward models in mathematical reasoning. arXiv preprint arXiv:2501.07301. Cited by: §1, §2, §3, §4.1.
- GenPRM: scaling test-time compute of process reward models via generative reasoning. External Links: 2504.00891, Link Cited by: §2.
- Processbench: identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559. Cited by: §2, §3.
- Webarena: a realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854. Cited by: §6.
- Chain-of-thought matters: improving long-context language models with reasoning path supervision. arXiv preprint arXiv:2502.20790. Cited by: §1.
- ReasonFlux-prm: trajectory-aware prms for long chain-of-thought reasoning in llms. arXiv preprint arXiv:2506.18896. Cited by: §1, §3, §4.2.
Appendix A Additional Experimental Details and Results
A.1 Main Experiments
The implementations are based on the verl framework (Sheng et al., 2025), and we follow most of its parameter settings. Specifically, we use the AdamW optimizer with learning rate . We adopt the clip-higher trick (Yu et al., 2025b), which clips the sampling ratio to an asymmetric range . Specifically, we set for models initialized from Qwen2.5-Math-1.5B-base and maintain for other cases. In each iteration, we sample prompts and roll out responses per prompt for GRPO and responses for PROF. Note that the policy update number for all algorithms is . For the regularization of step numbers in Algorithm 1, we take and . For the rollout stage, we use a temperature of and a top-p value of . We set the KL loss coefficient to and entropy loss coefficient to . All the models are trained with H100 GPUs. We set the training mini-batch size as and allow the models to generate tokens per prompt.
A.2 Prompt Template
We present the template used by the LLM to audit whether a correct response still contains reasoning flaws in Table 5.
Appendix B Additional Experimental Results
In this section, we include additional ablation studies and evaluation results for a more comprehensive understanding of the PROF framework.
B.1 Matched-Cost Results for Difficulty Levels 1–3
B.2 Effect of Rollout Numbers
We study the scale of rollout numbers with fixed policy-update number by varying . The lower-right plot in Figure 9 presents the test accuracy averaged over all five benchmarks for PROF-BOTH (Both) and PROF-POS (Correct) initialized from Qwen2.5-Math-7B-base. We observe that performance first increases and then decreases as grows, revealing a trade-off between enhancing process reasoning quality and avoiding reward hacking. Notably, PROF-POS decreases later (after ) because it only leverages PRM influence in the correct group, indicating that PROF-POS is more robust when PRM influence becomes stronger, such as when the ranking-and-filtering scale increases.
B.3 Additional Process Metrics on Math500
B.4 Variants of Filtration Methods
| Algorithm | Math500 | Minerva Math | Olympiad Bench | AIME24 | AMC23 | Average |
| Mean | 83.1 | 39.0 | 47.8 | 17.5 | 70.9 | 51.7 |
| Minimum | 82.9 | 38.3 | 46.7 | 20.8 | 65.9 | 50.9 |
| Sum | 82.4 | 38.1 | 47.4 | 17.7 | 67.5 | 50.6 |
| Ratio | 81.4 | 36.6 | 45.0 | 24.8 | 65.2 | 50.6 |
In this subsection, we investigate different ways of computing the consistency score , in addition to taking the mean of PRM scores over steps. Here, Mean denotes averaging over steps in Algorithm 1; Minimum and Sum denote taking the minimum and the sum over steps; and Ratio denotes filtering while preserving the original positive/negative sample distribution instead of balancing it. As shown in Table 6, Minimum (), Sum (), and Ratio () all underperform Mean. This suggests that the mean provides a more stable estimate of reasoning consistency: unlike the minimum, it is less sensitive to a single poorly scored step, and unlike the sum, it avoids bias toward longer trajectories. Additionally, balancing the correct/incorrect ratio lets consistency-based filtering select the better-supported group without breaking class balance.
B.5 Effect of Step Number
To verify that PROF does not help merely by increasing the number of steps, we evaluate Filter-Nstep, which ranks and filters samples by shorter step counts instead of lower PRM–ORM consistency.
From Table 6, we find that Ratio scores only on average and cannot compete with balanced filtering (PROF), which further corroborates the importance of maintaining a balanced correct/incorrect proportion. Additionally, because PROF increases the number of intermediate reasoning steps, we compare against simple step-length filtering to verify that the gain does not come merely from longer responses. As shown in Figure 11 and Table 7, Filter-Nstep mainly manipulates step length, exhibits an unreasonable increase followed by a sudden drop, and yields inferior average accuracy.
| Algorithm | Math500 | Minerva Math | Olympiad Bench | AIME24 | AMC23 | Average |
| PROF-BOTH | 83.1 | 39.0 | 47.8 | 17.5 | 70.9 | 51.7 |
| Filter-Nstep | 81.5 | 35.5 | 45.9 | 16.3 | 58.6 | 47.6 |
Appendix C Additional Examples