Colgate AI Club

Coding Project

This is a coding project. You'll write code to build this project.

Tools & Technologies

Pythonscikit-learnPandasJupyter

Prerequisites

Before starting, make sure you have:

Python account or installation
scikit-learn account or installation
Pandas account or installation
Jupyter account or installation

View GitHub Repository

Resources

Google Colab

scikit-learn Documentation

Twitter Dataset

Natural Language Processing Basics

Project Details

Project Overview

Sentiment analysis is one of the most popular applications of Natural Language Processing (NLP). In this beginner-friendly project, you'll build a machine learning classifier that can automatically determine whether a tweet expresses positive, negative, or neutral sentiment.

Learning Objectives

By completing this project, you will:

Understand the fundamentals of text classification
Learn how to preprocess text data for machine learning
Implement feature extraction using TF-IDF vectorization
Train and evaluate a Naive Bayes classifier
Visualize model performance with confusion matrices

Prerequisites

Basic Python programming knowledge
Familiarity with pandas and NumPy (helpful but not required)
Understanding of basic ML concepts (training/testing split, accuracy)
Google account for Colab access

Step-by-Step Guide

1. Environment Setup

Open Google Colab and create a new notebook. Install required packages:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

2. Load and Explore the Dataset

Download the Sentiment140 dataset from Kaggle and load it into your notebook:

# Load the dataset
df = pd.read_csv('training.csv', encoding='latin-1', header=None)
df.columns = ['target', 'id', 'date', 'flag', 'user', 'text']

# Explore the data
print(df.head())
print(f"Dataset shape: {df.shape}")
print(f"Sentiment distribution:\n{df['target'].value_counts()}")

3. Data Preprocessing

Clean and prepare the text data:

import re
import string

def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove mentions and hashtags
    text = re.sub(r'@\w+|#\w+', '', text)
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Convert to lowercase
    text = text.lower()
    return text

df['clean_text'] = df['text'].apply(clean_text)

4. Train-Test Split

X = df['clean_text']
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

5. Feature Extraction with TF-IDF

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

6. Train the Classifier

classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)

7. Evaluate Performance

y_pred = classifier.predict(X_test_tfidf)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

8. Test with Custom Tweets

def predict_sentiment(text):
    clean = clean_text(text)
    vec = vectorizer.transform([clean])
    prediction = classifier.predict(vec)[0]
    return "Positive" if prediction == 4 else "Negative"

# Try it out!
test_tweets = [
    "I love this new AI project!",
    "This is the worst experience ever.",
    "The weather is nice today."
]

for tweet in test_tweets:
    print(f"Tweet: {tweet}")
    print(f"Sentiment: {predict_sentiment(tweet)}\n")

Expected Outcomes

A trained sentiment classifier with 75-80% accuracy
Understanding of text preprocessing techniques
Experience with scikit-learn's ML pipeline
A reusable prediction function for new tweets

Next Steps

Try different classifiers (Logistic Regression, SVM)
Experiment with word embeddings (Word2Vec, GloVe)
Add neutral sentiment classification
Deploy your model as a web API using Flask

Troubleshooting

Issue: Low accuracy

Solution: Increase max_features in TfidfVectorizer or try different preprocessing

Issue: Out of memory errors

Solution: Use a smaller sample of the dataset or reduce max_features

Issue: Slow training

Solution: Enable GPU runtime in Colab (Runtime > Change runtime type > GPU)

Twitter Sentiment Analysis Classifier

Coding Project

Tools & Technologies

Prerequisites

Resources

Project Details

Project Overview

Learning Objectives

Prerequisites

Step-by-Step Guide

1. Environment Setup

2. Load and Explore the Dataset

3. Data Preprocessing

4. Train-Test Split

5. Feature Extraction with TF-IDF

6. Train the Classifier

7. Evaluate Performance

8. Test with Custom Tweets

Expected Outcomes

Next Steps

Troubleshooting

Submit a Resource