Identify discussions on a topic: An analysis of CodeProject messages regarding workplace fairness issues related to software developers

Grigorescu, Tudor (2024) Identify discussions on a topic: An analysis of CodeProject messages regarding workplace fairness issues related to software developers. Bachelor's Thesis, Computing Science.

Preview

Text
bCS2024GrigorescuTudor.pdf
Download (996kB) | Preview

Text
toestemming.pdf
Restricted to Registered users only
Download (94kB)

Abstract

Workplace fairness has become a significant topic of discussion, particularly in fastpaced and high-pressure fields like software development, where perceived or real inequities are prevalent. Concurrently, the rapid advancement of Large Language Models (LLMs) has garnered substantial interest in their applications across various domains, especially in Information Retrieval (IR) and Natural Language Processing (NLP). Integrating these technological advancements to analyze fairness issues in workplace environments poses a critical question: how can LLMs be effectively utilized for this purpose? To address this, we set out to study whether a BERTopic model could be used to create a topic distribution with a majority of topics related directly or indirectly to fairness issues from an assembled data set of CodeProject messages, then see what insights could be gained from it. To do this, we have created three implementations of this model that created sufficiently unique distributions that we analyzed based on a set of definitions that present various fairness problems in software engineering to determine the actual accuracy and precision of the models versus our expectations. The first version was a minimal pre-processing of the dataset in an attempt to see what the overall topic distribution would be like, then in the second we removed common stopwords and used a list of seed words based on the dimensions of fairness to encourage certain topics to stand out, and the last version used a more extensive pre-processing alongside a curated dictionary of synonyms based on the previous list to filter the dataset before applying the seed word method in the attempt to further remove any post not related to fairness issues. The analysis reveals a loss of accuracy for every attempt to increase the precision in our BERTopic implementations. With minimal pre-processing, the first model provided the highest accuracy, suggesting that a more straightforward approach was more effective in this context. In contrast, the second and third implementations, which involved more complex pre-processing and the use of seed words, resulted in a significant drop in accuracy. The conclusion is that while the precision in targeting specific fairness-related keywords increased, the broader context of fairness issues was overlooked, leading to decreased accuracy.

Item Type:	Thesis (Bachelor's Thesis)
Supervisor name:	Rastogi, A. and Andrikopoulos, V.
Degree programme:	Computing Science
Thesis type:	Bachelor's Thesis
Language:	English
Date Deposited:	23 Jul 2024 13:30
Last Modified:	31 Jul 2024 07:26
URI:	https://fse.studenttheses.ub.rug.nl/id/eprint/33607

Actions (login required)

View Item