Loading [MathJax]/extensions/Safe.js
Back

Training a Positive / Negative Text Classifiier

Before doing our subreddit analysis, we will train a model to classify whether a title is positive or negative. We will be using a dataset of 1,600,000 labelled Tweets (download).

Setup

The dependancies are listed out in requirements.txt. They can be quickly installed with Pip by running the following command

python -m pip -r requirements.txt

Summary

We succesfully preprocessed a corpus of Twitter Tweets and used them to train a Naive-Bayes Classifier with ~71% accuracy. The model has been pickled to bin/classifer.o for future usage.

Preprocessing Functions

Some things that need to be done before we can train our model.

  1. Tokenization All our tweets need to be tokenized to be processed. Tokenization isn't as simple as something like str.split('.') though. For example, "Mr.John likes iced coffee". If we ran this code, then the sentence would be improperly tokenized since "Mr.John" should not be split into 2 tokens.

  2. Lemmatization This is the process of mapping a word to its root. It is similar to stemming. Many words have the same meaning. For instance, "great", "greater", and "greatest" all have the same root.

  3. Normalization - Stopword removal + Lowercasing

    Uppercase and lowercase words hold the same value. As well, we will remove stopwords. Stopwords are commonly used words in language that are not significant parts of the sentence. For example, "a", "are", "may".

  4. Noise Removal

    Remove twitter handles, hashtags, phone numbers, and special characters which can interfere which hold no text value and may interfere with our training process. With our dataset, emoji's have already been removed.

Loading Data

Preprocessing

Before Preprocessing

@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D

After Preprocessing

Preparing the dataset

We need to format the data as a labelled featureset. This is a requirement to train the model. We will do a standard 80/20 train - test split

Training

Training our model with the Naive Bayes Classifier and using pickle to serialize our model so it can be reused later. It is 'naive' since it assumes that all our features are independent of each other. In our data, we only have 2 features (text, positive)

Results

4 means it is a positive sentiment.

0 means it is a negative s

Classification Report

Confusion Matrix

Future Improvements

This is an ok model for now. But we can make some improvements in the future