GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

https://arxiv.org/pdf/2601.05242

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

🧠 Mastering Multi-Objective Reinforcement Learning!

In this post, we explore the innovative framework of GDPO (Group Reward-Decoupled Normalization Policy Optimization) that teaches AI to achieve multiple goals simultaneously—like walking and chewing gum at the same time. We break down the complexities of aligning a model to satisfy diverse objectives while maintaining effectiveness in performance.

Discover how traditional methods collapse under conflicting demands, and learn how GDPO's novel approach of decoupling rewards restores the clarity needed for effective training signals. This post will equip you with insights into the pitfalls of multi-objective RL and the solutions that can lead to better AI performance.

📌 What You'll Learn:
• 🎯 The concept of multi-objective reinforcement learning and its challenges
• 🥤 Why mixing reward signals can lead to ineffective learning
• 🔑 How to decouple rewards for better signal clarity
• 🧩 Practical solutions to the "Lazy Lawyer" problem in reward conditioning
• ⚖️ The five-step checklist for fixing multi-objective RL

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
https://arxiv.org/pdf/2601.05242
Shih-Yang Liu1, Xin Dong*, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng1, Yejin Choi, Jan Kautz, Pavlo Molchanov
NVIDIA

An Explainer Video:

A Gentle Slide Deck:

Connecting...

Let's Dive In...

1. Introduction: The Evolution of Multi-Objective Alignment

The paradigm of Large Language Model (LLM) training has shifted from simple next-token prediction and single-objective accuracy toward complex alignment with multi-faceted human preferences. Modern production models must simultaneously satisfy diverse, often competing constraints, including logical correctness, safety, stylistic formatting, and conciseness. As these requirements grow in complexity, the industry necessitates Reinforcement Learning (RL) pipelines capable of processing heterogeneous signals without compromising the resolution of the learning objective.

This document introduces Group reward-Decoupled Normalization Policy Optimization (GDPO), a framework designed to overcome the critical limitations of standard Group Relative Policy Optimization (GRPO) in multi-reward environments. While GRPO has gained traction due to its efficiency in eliminating the need for a separate value model, its "sum-then-normalize" approach introduces a significant technological gap: Reward Signal Collapse. GDPO serves as a robust solution to this resolution loss, ensuring stable and precise alignment for high-dimensional reward spaces.

Figure 1 | (a): An overview of GDPO, which performs group-wise normalization per reward and then applies batch-wise advantage normalization to preserve a stable numerical range independent of reward count and improve update stability. (b): Median and IQR reward curves over five runs of Qwen2.5-Instruct-1.5Btool-calling RL, demonstrating that GDPO consistently converges to higher correctness and format reward score than GRPO.

2. Theoretical Analysis: The Phenomenon of Reward Signal Collapse

In standard multi-objective RL implementations using GRPO, individual reward components (r_1 \dots r_n) are typically summed into a scalar r_{sum} before undergoing group-wise normalization. Analytically, this approach fails because it prematurely compresses the reward space, leading to Reward Signal Collapse. This collapse manifests as a critical loss of gradient signal resolution; when different rollout profiles result in identical advantage values, the policy gradient effectively "averages out" the learning signal, resulting in gradient signal sparsity and suboptimal convergence.

Distinct Advantage Groups and Scaling Gaps

The impact of this compression is best illustrated by the number of unique advantage groups the algorithm can distinguish.

Case Study: Binary Reward Resolution

Consider a scenario with two binary rewards (r_1, r_2 \in \{0, 1\}) and two rollouts (G=2) used to calculate group-relative advantages. While there are six distinct reward combinations possible, GRPO maps them into only two unique advantage groups:

  • Combinations (0, 1), (0, 2), and (1, 2) all result in identical normalized advantages of (-0.7071, 0.7071).
  • Combinations (0, 0), (1, 1), and (2, 2) all result in (0, 0).

In this setup, GRPO is functionally "blind" to the difference between a rollout that satisfies two rewards versus one that satisfies only one, provided the relative difference within the group is identical. GDPO, by decoupling normalization, preserves three distinct groups in this same scenario.

Theoretical analysis shows that this gap isn't constant; it widens exponentially as the number of rewards (n) or rollouts (G) increases. For example, at 16 rollouts, GDPO maintains 180+ distinct advantage groups compared to GRPO’s ~120. When rewards scale to n=8, GDPO maintains over 700 distinct groups while GRPO remains plateaued at approximately 120.

The Insufficiency of Standard Deviation Removal

Some variants (e.g., GRPO w/o std) attempt to mitigate collapse by removing the standard deviation term. While this slightly increases the advantage group count, empirical evidence—particularly in tool-calling tasks—demonstrates that this modification is insufficient. It fails to restore the necessary training resolution and frequently introduces training instability, often leading to a total failure to converge on structured formatting rewards.

Figure 2 | Comparison of GRPO and GDPO advantage computation in a two-binary-reward, two-rollout example. GRPO maps different reward combinations into only two distinct advantage groups, whereas GDPO normalizes each reward independently and retains three distinct groups of advantage values. We skip the batch-wise normalization calculation step in GDPO here for simplicity since it does not change the number of distinct advantage groups.

3. The GDPO Framework: Decoupled Normalization Architecture

GDPO shifts the RL optimization strategy from a monolithic sum to a modular, decoupled architecture. By processing each reward component independently, it preserves the integrity of individual signals before they are aggregated into a final policy update.

Mathematical Formulation

The GDPO framework utilizes a three-tiered advantage estimation process. We define the following notation: n as the number of objectives, G as the number of rollouts, B as batch size, and D_{Batch} = \{question \, i \mid i = 1, \dots, B\} as the set of questions in a batch.

  1. Decoupled Group-wise Normalization: Each reward component k is normalized independently across the group: A^{(i,j)}_k = \frac{r^{(i,j)}_k - \text{mean}\{r^{(i,1)}_k, \ldots, r^{(i,G)}_k\}}{\text{std}\{r^{(i,1)}_k, \ldots, r^{(i,G)}_k\}}
  2. Aggregation: The independently normalized advantages are summed: A^{(i,j)}_{sum} = A^{(i,j)}_1 + \cdots + A^{(i,j)}_n
  3. Batch-wise Advantage Normalization (BN): A final normalization is applied across the entire training batch to maintain a stable numerical range: \hat{A}^{(i,j)}_{sum} = \frac{A^{(i,j)}_{sum} - \text{mean}\{A^{(i',j')}_{sum} \mid i' \in D_{Batch}, j' = 1, \ldots, G\}}{\text{std}\{A^{(i',j')}_{sum} \mid i' \in D_{Batch}, j' = 1, \ldots, G\} + \epsilon}

The Rationale for Batch-wise Normalization: Without BN, the magnitude of the aggregated advantage grows linearly with the number of rewards n. This unnormalized growth can lead to gradient explosion, destabilizing the policy update. BN ensures that the training signal remains within a stable range regardless of the number of rewards, preventing the catastrophic failures observed in "GDPO w/o BN" configurations.

Figure 3 | Comparison of the number of distinct advantage groups produced by GRPO, GRPO without standard deviation normalization (GRPO w/o std), and GDPO. As the number of roll outs (left) or rewards(right) grows, GDPO consistently preserve a substantially larger number of distinct advantage groups compared to GRPO and GRPO w/o std. This results in advantage estimations that provide more expressive training signals.

4. Strategic Priority Management: Weighting vs. Conditioning

Effective alignment requires managing the relative priority of objectives. While Reward Weighting (assigning coefficients to normalized advantages) is a common lever, it frequently fails when rewards vary in difficulty. Models often "hack" easier objectives (like response length) while ignoring high-priority, difficult objectives (like logical correctness).

To enforce strict prioritization, we utilize Conditioned Rewards. For example, a length reward ℛ̃_{length} is only granted if the primary correctness threshold is met: ℛ̃_{length} = \begin{cases} 1 & \text{if length } \le l \text{ AND } ℛ_{correct} = 1 \\ 0 & \text{otherwise} \end{cases} This logic forces the model to allocate "intelligence per token," addressing the most challenging objectives first.

Best Practices for Priority Tuning:

  1. Identify Difficulty Gaps: Recognize when one reward is significantly easier for the model to "game."
  2. Apply Conditioning: Use conditioned functions to prevent easy-reward hacking.
  3. Fine-tune with Weights: Once primary objectives are stabilized, use weights for secondary behavioral adjustments.
  4. Monitor Intelligence-Efficiency Trade-offs: Ensure length constraints do not suppress reasoning depth.
Figure 4 | Median and IQR reward curves across five runs of Qwen2.5-1.5B on the tool-calling task for GDPO, GRPO, and GRPO w/o std. GDPO consistently converges to higher correctness and format rewards, while GRPO w/o std matches correctness gains but fails to converge on the format reward

5. Empirical Performance and Benchmarking Results

GDPO’s efficacy was validated across Tool Calling, Mathematical Reasoning, and Coding tasks using diverse architectures.

Tool Calling Results

GDPO consistently outperformed GRPO in balancing functional accuracy with strict structure.

Model

Method

Avg Acc ↑

Correct Format ↑

Qwen2.5-Instruct-1.5B

GRPO

30.18%

76.33%

Qwen2.5-Instruct-1.5B

GDPO

32.81%

80.66%

Qwen2.5-Instruct-3B

GRPO

39.20%

81.64%

Qwen2.5-Instruct-3B

GDPO

40.87%

82.23%

Notably, GRPO w/o std failed entirely on the format reward (0% correctness), emphasizing the stability of the GDPO architecture.

Mathematical Reasoning: The Recovery Phenomenon

On benchmarks (AIME, MATH, AMC), GDPO demonstrated a unique "recovery" phenomenon. In early training, both GDPO and GRPO prioritize the easier length reward, causing an initial drop in correctness. However, GDPO effectively recovers the correctness signal, while GRPO’s correctness eventually declines as it fails to balance the competing objectives.

For DeepSeek-R1-1.5B, GDPO achieved significant accuracy gains of 2.6% on MATH, 6.7% on AIME, and 2.3% on Olympiad, alongside an 80% reduction in length-exceeding responses.

Coding Reasoning

In a three-objective setting (Passrate, Length, Bug Reduction), GDPO maintained high pass rates while simultaneously reducing bug ratios and length violations. This confirms that GDPO scales effectively to higher-dimensional reward spaces where standard GRPO struggles to maintain a favorable trade-off across all dimensions.

Figure 5 | Training behavior of GRPO and GDPO on DeepSeek-R1-1.5B across correctness reward, length reward, and maximum batch response length. Both methods rapidly maximize the length reward, briefly suppressing correctness, yet GDPO subsequently recovers it and surpasses GRPO. After roughly 400 steps,GRPO’s correctness score declines and its length-constraint violations increase, as reflected by rising maximum response lengths. In contrast, GDPO continues to improve correctness while steadily improving the control over response length.

6. Implementation, Stability, and Conclusion

GDPO is designed for immediate integration into production RLHF pipelines and is supported in frameworks like HF-TRL, verl, and Nemo-RL. Practical implementation relies heavily on Batch-wise Normalization (BN); experimental evidence confirms that runs without BN occasionally suffer from catastrophic divergence due to unscaled advantage magnitudes.

In summary, GDPO remediates reward signal collapse by decoupling normalization and preserving advantage granularity. By providing a richer training signal, it enables stable convergence and superior alignment across complex, multi-objective tasks. As LLMs are increasingly tasked with balancing accuracy, safety, and efficiency, GDPO serves as the superior foundation for aligning models with the nuanced preferences of human users in real-world deployment.

Figure 6 | Average accuracy and exceed-length ratios for GRPO/GDPO-trained DeepSeek-R1-7B models under varying length reward weights {1.0, 0.75, 0.5, 0.25}, with and without the conditioned length rewardℛ˜length, on mathematical reasoning tasks.
Figure 7 | Training curves of GRPO and GDPO with the conditioned length reward ℛ˜length on DeepSeek-R1-7Bacross correctness reward, length reward.
Loading Flashcards...
Initializing connection...

Fin...