Scholz, Jonas (2025) Exploring the Limitations of LLMs Logical Reasoning Capabilities Using Knights and Knaves Puzzles. Master's Thesis / Essay, Artificial Intelligence.
|
Text
mAI2025ScholzJ.pdf Download (1MB) | Preview |
|
|
Text
Toestemming.pdf Restricted to Registered users only Download (183kB) |
Abstract
Both LLMs and reasoning models seem to exhibit logical reasoning capabilities on some benchmarks. The key issues with these are twofold: firstly, data contamination and secondly no complexity scaling, which makes testing reasoning capabilities challenging. To address these limitations, this paper introduces a dynamic benchmark based on the Knights and Knaves Puzzle. It can generate puzzles with a variable number of people in the puzzle and per statement to control the complexity of the puzzle. Further, variance was added by having exchangeable names, type labels and places to combat data contamination. This benchmark was used to evaluate the reasoning capabilities of LLMs by measuring the accuracy on the puzzles. The results showed that Gemini 1.5 Pro performed the worst, with a halving of accuracy from the least to the most complex puzzle. In contrast, the reasoning models showed better performance and only showed a slight decrease in accuracy as the number of people in the puzzle decreased. This seems to show that state-of-the-art models are performing well on logical reasoning tasks, though the performance decreases as puzzle complexity increases. Further tests are needed to determine if that pattern holds as complexity rises and to determine if the models are reasoning by examining the reasoning steps. Overall, this shows that the introduced benchmark is capable of evaluating LLMs and as the complexity is scalable it should mean that it will remain relevant.
| Item Type: | Thesis (Master's Thesis / Essay) |
|---|---|
| Supervisor name: | Verheij, H.B. and Steging, C.C. |
| Degree programme: | Artificial Intelligence |
| Thesis type: | Master's Thesis / Essay |
| Language: | English |
| Date Deposited: | 12 Aug 2025 08:23 |
| Last Modified: | 12 Aug 2025 08:23 |
| URI: | https://fse.studenttheses.ub.rug.nl/id/eprint/36693 |
Actions (login required)
![]() |
View Item |
