GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards
GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards
要約
Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens, or use costly process reward models (PRMs) for step-level supervisi…