Detecting Online Abuse: Fine-Tuning LLMs for Abusive Language Detection
- The proliferation of online abuse on social media platforms has emerged as a significant concern, negatively impacting users' mental health and online experiences. While the Natural Language Processing (NLP) community has developed various computational methods for abuse detection, including Support Vector Machines (SVMs) and Convolutional Neural Networks (CNNs), existing approaches predominantly focus on identifying explicit forms of abuse. This narrow focus overlooks subtle and contextual forms of online harassment, which can be equally damaging to users' wellbeing.
This thesis presents a novel approach to online abuse detection by integrating contextual embeddings with sentiment analysis features through the fine-tuning of Large Language Models (LLMs). Our methodology leverages a comprehensive dataset of 47,000 annotated tweets for training, combined with sentiment analysis capabilities developed using 50,000 IMDB movie reviews. The system employs DistilBERT architecture to develop a sophisticated detection framework capable of identifying six distinct categories of abuse: ethnicity-based, age-based, gender-based, religion-based, other cyberbullying, and non-cyberbullying content. The author established a rigorous evaluation framework employing multiple metrics, including accuracy, recall, and F1 score, to assess the model's performance in detecting both explicit and nuanced forms of online abuse.
The integrated system achieved an overall accuracy of 85\% across 6 categories on the cyberbullying dataset, outperforming other methodologies applied to the same data. In direct comparison, our approach— which uniquely combines contextual embeddings with sentiment analysis—demonstrated significant improvements over traditional fine-tuning methods, such as those using only BERT or RoBERTa, particularly in detecting subtle forms of abuse. Most notably, our system was more effective at identifying passive-aggressive content and context-dependent harassment, challenges that often cause conventional detection methods to fall short. This enhanced performance can be attributed to the model's ability to capture nuanced linguistic cues through its integrated analysis of both contextual information and sentiment, thereby offering a more refined interpretation of potentially harmful content.
This research emphasizes the critical importance of incorporating subtle abuse detection into online content moderation systems. By developing more sophisticated detection methods that can identify both overt and nuanced forms of harassment, this work contributes to the creation of safer and more inclusive online spaces that facilitate constructive dialogue. The findings of this study have significant implications for the development of more effective content moderation tools and the broader goal of fostering healthier online communities.