Twitter Sentiment Analysis Classifier
Build a machine learning model to classify tweets as positive, negative, or neutral using Python and scikit-learn.
Coding Project
This is a coding project. You'll write code to build this project.
Tools & Technologies
Prerequisites
Before starting, make sure you have:
- Python account or installation
- scikit-learn account or installation
- Pandas account or installation
- Jupyter account or installation
Resources
Project Details
Project Overview
Sentiment analysis is one of the most popular applications of Natural Language Processing (NLP). In this beginner-friendly project, you'll build a machine learning classifier that can automatically determine whether a tweet expresses positive, negative, or neutral sentiment.
Learning Objectives
By completing this project, you will:
- Understand the fundamentals of text classification
- Learn how to preprocess text data for machine learning
- Implement feature extraction using TF-IDF vectorization
- Train and evaluate a Naive Bayes classifier
- Visualize model performance with confusion matrices
Prerequisites
- Basic Python programming knowledge
- Familiarity with pandas and NumPy (helpful but not required)
- Understanding of basic ML concepts (training/testing split, accuracy)
- Google account for Colab access
Step-by-Step Guide
1. Environment Setup
Open Google Colab and create a new notebook. Install required packages:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
2. Load and Explore the Dataset
Download the Sentiment140 dataset from Kaggle and load it into your notebook:
# Load the dataset
df = pd.read_csv('training.csv', encoding='latin-1', header=None)
df.columns = ['target', 'id', 'date', 'flag', 'user', 'text']
# Explore the data
print(df.head())
print(f"Dataset shape: {df.shape}")
print(f"Sentiment distribution:\n{df['target'].value_counts()}")
3. Data Preprocessing
Clean and prepare the text data:
import re
import string
def clean_text(text):
# Remove URLs
text = re.sub(r'http\S+', '', text)
# Remove mentions and hashtags
text = re.sub(r'@\w+|#\w+', '', text)
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Convert to lowercase
text = text.lower()
return text
df['clean_text'] = df['text'].apply(clean_text)
4. Train-Test Split
X = df['clean_text']
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
5. Feature Extraction with TF-IDF
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
6. Train the Classifier
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)
7. Evaluate Performance
y_pred = classifier.predict(X_test_tfidf)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
8. Test with Custom Tweets
def predict_sentiment(text):
clean = clean_text(text)
vec = vectorizer.transform([clean])
prediction = classifier.predict(vec)[0]
return "Positive" if prediction == 4 else "Negative"
# Try it out!
test_tweets = [
"I love this new AI project!",
"This is the worst experience ever.",
"The weather is nice today."
]
for tweet in test_tweets:
print(f"Tweet: {tweet}")
print(f"Sentiment: {predict_sentiment(tweet)}\n")
Expected Outcomes
- A trained sentiment classifier with 75-80% accuracy
- Understanding of text preprocessing techniques
- Experience with scikit-learn's ML pipeline
- A reusable prediction function for new tweets
Next Steps
- Try different classifiers (Logistic Regression, SVM)
- Experiment with word embeddings (Word2Vec, GloVe)
- Add neutral sentiment classification
- Deploy your model as a web API using Flask
Troubleshooting
Issue: Low accuracy
- Solution: Increase max_features in TfidfVectorizer or try different preprocessing
Issue: Out of memory errors
- Solution: Use a smaller sample of the dataset or reduce max_features
Issue: Slow training
- Solution: Enable GPU runtime in Colab (Runtime > Change runtime type > GPU)