論文 深掘り Hugging Face 発表: 2026-06-08 HF ↑7

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

著者: Han Zhou, Adam X. Yang, Laurence Aitchison, Anna Korhonen, Albert Q. Jiang

要約

Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given promp…

#llm#rl#coding#benchmark

同じカテゴリの記事