Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling
Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling
要約
Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules …