beginnerEstimated time: 4 hours

Twitter Sentiment Analysis Classifier

Build a machine learning model to classify tweets as positive, negative, or neutral using Python and scikit-learn.

Coding Project

This is a coding project. You'll write code to build this project.

Tools & Technologies

Pythonscikit-learnPandasJupyter

Prerequisites

Before starting, make sure you have:

  • Python account or installation
  • scikit-learn account or installation
  • Pandas account or installation
  • Jupyter account or installation

Resources

Project Details

Project Overview

Sentiment analysis is one of the most popular applications of Natural Language Processing (NLP). In this beginner-friendly project, you'll build a machine learning classifier that can automatically determine whether a tweet expresses positive, negative, or neutral sentiment.

Learning Objectives

By completing this project, you will:

  • Understand the fundamentals of text classification
  • Learn how to preprocess text data for machine learning
  • Implement feature extraction using TF-IDF vectorization
  • Train and evaluate a Naive Bayes classifier
  • Visualize model performance with confusion matrices

Prerequisites

  • Basic Python programming knowledge
  • Familiarity with pandas and NumPy (helpful but not required)
  • Understanding of basic ML concepts (training/testing split, accuracy)
  • Google account for Colab access

Step-by-Step Guide

1. Environment Setup

Open Google Colab and create a new notebook. Install required packages:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

2. Load and Explore the Dataset

Download the Sentiment140 dataset from Kaggle and load it into your notebook:

# Load the dataset
df = pd.read_csv('training.csv', encoding='latin-1', header=None)
df.columns = ['target', 'id', 'date', 'flag', 'user', 'text']

# Explore the data
print(df.head())
print(f"Dataset shape: {df.shape}")
print(f"Sentiment distribution:\n{df['target'].value_counts()}")

3. Data Preprocessing

Clean and prepare the text data:

import re
import string

def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove mentions and hashtags
    text = re.sub(r'@\w+|#\w+', '', text)
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Convert to lowercase
    text = text.lower()
    return text

df['clean_text'] = df['text'].apply(clean_text)

4. Train-Test Split

X = df['clean_text']
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

5. Feature Extraction with TF-IDF

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

6. Train the Classifier

classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)

7. Evaluate Performance

y_pred = classifier.predict(X_test_tfidf)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

8. Test with Custom Tweets

def predict_sentiment(text):
    clean = clean_text(text)
    vec = vectorizer.transform([clean])
    prediction = classifier.predict(vec)[0]
    return "Positive" if prediction == 4 else "Negative"

# Try it out!
test_tweets = [
    "I love this new AI project!",
    "This is the worst experience ever.",
    "The weather is nice today."
]

for tweet in test_tweets:
    print(f"Tweet: {tweet}")
    print(f"Sentiment: {predict_sentiment(tweet)}\n")

Expected Outcomes

  • A trained sentiment classifier with 75-80% accuracy
  • Understanding of text preprocessing techniques
  • Experience with scikit-learn's ML pipeline
  • A reusable prediction function for new tweets

Next Steps

  • Try different classifiers (Logistic Regression, SVM)
  • Experiment with word embeddings (Word2Vec, GloVe)
  • Add neutral sentiment classification
  • Deploy your model as a web API using Flask

Troubleshooting

Issue: Low accuracy

  • Solution: Increase max_features in TfidfVectorizer or try different preprocessing

Issue: Out of memory errors

  • Solution: Use a smaller sample of the dataset or reduce max_features

Issue: Slow training

  • Solution: Enable GPU runtime in Colab (Runtime > Change runtime type > GPU)
Colgate AI Club