Shaping Reasoning Through Rewards: Investigating Reward Structures in Post-Training LLMs with Pure Reinforcement Learning

Pradipta, Benediktus (2025) Shaping Reasoning Through Rewards: Investigating Reward Structures in Post-Training LLMs with Pure Reinforcement Learning. Bachelor's Thesis, Artificial Intelligence.

Preview

Text
bAI2025BenediktusFP.pdf
Download (1MB) | Preview

Text
Toestemming.pdf
Restricted to Registered users only
Download (242kB)

Abstract

Recent breakthroughs in Large Reasoning Models show that pure reinforcement learning can dramatically improve mathematical reasoning without supervised fine-tuning, yet the principles of effective reward design remain poorly understood. We systematically compared three reward structures of increasing complexity using GRPO on a 1.5B parameter Qwen2.5 model trained on GSM8K. Counterintuitively, a minimal reward structure (accuracy + format only) achieved 48.4% pass@1 accuracy, significantly outperforming complex designs with length penalties, repetition constraints, and reasoning incentives (22.8% and 29.9%). Complex multicomponent rewards created conflicting optimization gradients, causing training instability and reward hacking. In contrast, the simple objective enabled spontaneous emergence of step-by-step reasoning in 64.9% of outputs without explicit incentives. These findings challenge conventional wisdom about reward engineering, suggesting that creating conditions for emergent intelligence through minimal constraints is more effective than explicitly encoding desired behaviors. This work provides evidence for a ”less is more” principle in reward design for reasoning-capable AI systems.

Item Type:	Thesis (Bachelor's Thesis)
Supervisor name:	Fernandes Cunha, R.
Degree programme:	Artificial Intelligence
Thesis type:	Bachelor's Thesis
Language:	English
Date Deposited:	12 Aug 2025 08:28
Last Modified:	12 Aug 2025 08:28
URI:	https://fse.studenttheses.ub.rug.nl/id/eprint/36723

Actions (login required)

View Item