Cyberbullying Detection on Social Media Using Text Mining

Introduction

Cyberbullying on social media is a pressing issue that affects individuals across various demographics. The aim of this project is to develop a system that can detect instances of cyberbullying in real-time using text mining techniques. This proposal outlines the approach to building an effective cyberbullying detection system by leveraging machine learning and natural language processing (NLP) methodologies.

Background

Recent research has highlighted the potential of text mining and machine learning in identifying cyberbullying content on social media platforms. Techniques such as sentiment analysis and feature extraction have been employed to analyze textual data for harmful content. The integration of pre-trained models like BERT has shown promise in improving detection accuracy by understanding contextual nuances in language.

Project Objective

The primary objective of this project is to create a robust system capable of detecting cyberbullying across multiple social media platforms. This system will utilize advanced text mining techniques to identify and classify harmful content, thereby enabling timely interventions.

Methodology

1. Data Collection and Preprocessing

Datasets: Use publicly available datasets such as those from Twitter and Wikipedia, which contain labeled instances of hate speech and personal attacks.
Text Preprocessing: Implement preprocessing steps such as tokenization, stop-word removal, and stemming to prepare the data for analysis.

2. Model Architecture

Machine Learning Models: Employ models like Support Vector Machines (SVM), Random Forest, and pre-trained BERT for text classification.
Feature Extraction: Utilize techniques such as TF-IDF and word embeddings to extract meaningful features from the text data.

3. Training and Evaluation

Training: Train the models using labeled datasets with a focus on minimizing false positives and negatives.
Evaluation Metrics: Assess model performance using metrics such as precision, recall, F1-score, and accuracy.

Expected Outcomes

The proposed system is expected to accurately detect instances of cyberbullying with high precision and recall rates. By leveraging machine learning techniques and comprehensive feature extraction methods, the system should effectively identify harmful content across diverse linguistic contexts.

Conclusion

This project aims to contribute to safer online environments by developing an advanced cyberbullying detection system. The integration of machine learning models with text mining techniques is anticipated to enhance the detection capabilities significantly.

For further details on related research, please refer to the paper "Cyber-Bullying Detection Via Text Mining and Machine Learning," available at https://ieeexplore.ieee.org/document/9579625.

Dataset link: Twitter Hate Speech Dataset