Linear Text Classification Examples And what is TF-IDF?

Practical Example of Linear Text Classification

Let’s look at a straightforward real-world example of using logistic regression to categorise movie reviews as either good or negative.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Example movie reviews
reviews = ["I love this movie", "This was a terrible movie", 
"Absolutely fantastic!", "Not great, very boring"]

labels = [1, 0, 1, 0]  # 1 for positive, 0 for negative

# Convert text to Bag of Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(reviews)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25)

# Train a Logistic Regression classifier
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)
print(predictions)

In this straightforward example, a few movie reviews are converted into a numerical format using Bag of Words, and then they are classified as favourable or negative using logistic regression.

Although Bag of terms is a straightforward and popular approach, there are other, more sophisticated methods for representing text in Natural Language Processing. One of these is TF-IDF (Term Frequency-Inverse Document Frequency), which is a variation of BoW that prioritises uncommon terms over common ones.

What is TF-IDF?

Natural Language Processing uses the TF-IDF, a numerical statistic, to assess a word’s importance in a document relative to a corpus, or group of documents. By emphasising terms that are common in a particular document but less common in many others, TF-IDF aims to make those words more significant or unique to that document.

Components of TF-IDF

A term’s term frequency (TF) indicates how often it occurs in a document. It is the document’s word count in its raw form.

By using Inverse Document Frequency (IDF), terms that occur often in several documents are given less weight. A word is deemed less informative if it appears frequently in papers.

TF-IDF Rating

The TF and IDF values of a word in a document are multiplied to determine its final TF-IDF score:

Higher TF-IDF scores will be assigned to words that appear frequently in one document but infrequently in many others.

Example of TF-IDF Calculatio

Assume you have three texts in your corpus:

“I adore machine learning.”
“Deep learning is something I adore.”
“Deep Blue is an automated chess player.”

Objective

The TF-IDF score for the word “machine” in the first document is what it wish to determine.

Methodical Calculation

Step 1: Calculate Term Frequency (TF)

The term “machine” appears once out of four words in the first document. Thus:

Step 2: Calculate Inverse Document Frequency (IDF)

Two of the three documents (documents 1 and 3) contain the word “machine.” Thus:

Step 3: Calculate TF-IDF

Increase TF and IDF by: Therefore, “machine” in Document 1 has a TF-IDF score of 0.044.

Practical Example in Python

The scikit-learn library’s TfidfVectorizer is used to implement the TF-IDF idea in Python. A detailed tutorial on TF-IDF implementation and Linear text classification is provided below.

Walkthrough of the Code

Import Required Libraries: To apply machine learning models like Naive Bayes and to perform vectorisation (TF-IDF), require a number of libraries.

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.model_selection import train_test_split 
from sklearn.naive_bayes import MultinomialNB 
from sklearn.metrics import accuracy_score, classification_report

Sample data (replace with your own dataset)

documents = ["This is a positive document.", "Negative sentiment detected.",
"Another positive example.", "Negative review here." ]
labels = [1, 0, 1, 0]  # 1 for positive, 0 for negative

Split the Data: Divide the data into training and testing sets in order to train and evaluate the model. The dataset is automatically divided by the train_test_split algorithm.

X_train, X_test, y_train, y_test = train_test_split(documents, labels, 
test_size=0.2, random_state=42)

TF-IDF Vectorization: Next, use TF-IDF to transform the text data into numerical format.

vectorizer = TfidfVectorizer() 
X_train_tfidf = vectorizer.fit_transform(X_train) 
X_test_tfidf = vectorizer.transform(X_test)

The model learns the vocabulary and performs the TF-IDF transformation when fit_transform is applied to the training data.
Using the same vocabulary acquired from the training data, transform is performed to the test data.

Train the Naive Bayes Classifier: I’m using the Multinomial Naive Bayes algorithm, which is frequently used for Linear text classification. You can train any classifier you like, including logistic regression, SVM, MLP, etc.

classifier = MultinomialNB() 
classifier.fit(X_train_tfidf, y_train)

Make Predictions: Utilise the model to forecast the test data’s labels following training.

predictions = classifier.predict(X_test_tfidf)

Evaluate the Model: To determine how well the classifier performed, use the accuracy score and classification report to assess the model’s performance.

accuracy = accuracy_score(y_test, predictions)
report = classification_report(y_test, predictions)  
print(f"Accuracy: {accuracy:.2f}") print("Classification Report:\n", report)

Practical Example Summary

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Sample data
documents = [
    "This is a positive document.",
    "Negative sentiment detected.",
    "Another positive example.",
    "Negative review here."
]
labels = [1, 0, 1, 0]  # 1 for positive, 0 for negative

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(documents, labels, test_size=0.2, random_state=42)

# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train the Naive Bayes Classifier
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)

# Make predictions
predictions = classifier.predict(X_test_tfidf)

# Evaluate the classifier
accuracy = accuracy_score(y_test, predictions)
report = classification_report(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", report)

Page Content

Tutorials

Linear Text Classification Examples And what is TF-IDF?

Practical Example of Linear Text Classification

What is TF-IDF?

Example of TF-IDF Calculatio

Practical Example in Python

LEAVE A REPLY Cancel reply