Steringa, Quinten (2025) No Supervision, No Problem: Pure Reinforcement Learning Improves Mathematical Reasoning in Small Language Models. Bachelor's Thesis, Artificial Intelligence.
|
Text
bAI2025SteringaQE.pdf Download (5MB) | Preview |
|
|
Text
Toestemming.pdf Restricted to Registered users only Download (242kB) |
Abstract
This study explores whether pure reinforcement learning (RL), without supervised fine-tuning (SFT), can improve the mathematical reasoning ability of small language models. Using Group Relative Policy Optimization (GRPO), four pre-trained Qwen model variants were post-trained using only RL on a subset of the GSM8K dataset. Models specialized in mathematical reasoning, such as Qwen2.5 Math-1.5B and Qwen2-Math-1.5B, achieved significant improvements in pass@1 accuracy compared to their baselines. General-purpose models showed modest improvements, while a smaller 0.5B model suffered a performance drop, revealing capacity limitations when optimizing multiple objectives. Notably, a direct comparison showed that pure RL outperformed the conventional SFT-to-RL approach in both accuracy and training efficiency under a fixed maximum output token limit. The experimental results demonstrate that pure RL can effectively improve reasoning ability when sufficient domain specialization and model capacity are present, potentially eliminating the need for costly SFT in resource-limited settings.
| Item Type: | Thesis (Bachelor's Thesis) |
|---|---|
| Supervisor name: | Fernandes Cunha, R. |
| Degree programme: | Artificial Intelligence |
| Thesis type: | Bachelor's Thesis |
| Language: | English |
| Date Deposited: | 23 Jul 2025 08:39 |
| Last Modified: | 23 Jul 2025 08:39 |
| URI: | https://fse.studenttheses.ub.rug.nl/id/eprint/36470 |
Actions (login required)
![]() |
View Item |
