Pradipta, Benediktus (2025) Shaping Reasoning Through Rewards: Investigating Reward Structures in Post-Training LLMs with Pure Reinforcement Learning. Bachelor's Thesis, Artificial Intelligence.
|
Text
bAI2025BenediktusFP.pdf Download (1MB) | Preview |
|
|
Text
Toestemming.pdf Restricted to Registered users only Download (242kB) |
Abstract
Recent breakthroughs in Large Reasoning Models show that pure reinforcement learning can dramatically improve mathematical reasoning without supervised fine-tuning, yet the principles of effective reward design remain poorly understood. We systematically compared three reward structures of increasing complexity using GRPO on a 1.5B parameter Qwen2.5 model trained on GSM8K. Counterintuitively, a minimal reward structure (accuracy + format only) achieved 48.4% pass@1 accuracy, significantly outperforming complex designs with length penalties, repetition constraints, and reasoning incentives (22.8% and 29.9%). Complex multicomponent rewards created conflicting optimization gradients, causing training instability and reward hacking. In contrast, the simple objective enabled spontaneous emergence of step-by-step reasoning in 64.9% of outputs without explicit incentives. These findings challenge conventional wisdom about reward engineering, suggesting that creating conditions for emergent intelligence through minimal constraints is more effective than explicitly encoding desired behaviors. This work provides evidence for a ”less is more” principle in reward design for reasoning-capable AI systems.
| Item Type: | Thesis (Bachelor's Thesis) |
|---|---|
| Supervisor name: | Fernandes Cunha, R. |
| Degree programme: | Artificial Intelligence |
| Thesis type: | Bachelor's Thesis |
| Language: | English |
| Date Deposited: | 12 Aug 2025 08:28 |
| Last Modified: | 12 Aug 2025 08:28 |
| URI: | https://fse.studenttheses.ub.rug.nl/id/eprint/36723 |
Actions (login required)
![]() |
View Item |
