Page Content

Tutorials

Linear Text Classification Examples And what is TF-IDF?

Linear Text Classification
Linear Text Classification

Practical Example of Linear Text Classification

Let’s look at a straightforward real-world example of using logistic regression to categorise movie reviews as either good or negative.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Example movie reviews
reviews = ["I love this movie", "This was a terrible movie", 
"Absolutely fantastic!", "Not great, very boring"]
labels = [1, 0, 1, 0]  # 1 for positive, 0 for negative
# Convert text to Bag of Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(reviews)
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25)
# Train a Logistic Regression classifier
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
print(predictions)

In this straightforward example, a few movie reviews are converted into a numerical format using Bag of Words, and then they are classified as favourable or negative using logistic regression.

Although Bag of terms is a straightforward and popular approach, there are other, more sophisticated methods for representing text in Natural Language Processing. One of these is TF-IDF (Term Frequency-Inverse Document Frequency), which is a variation of BoW that prioritises uncommon terms over common ones.

What is TF-IDF?

Natural Language Processing uses the TF-IDF, a numerical statistic, to assess a word’s importance in a document relative to a corpus, or group of documents. By emphasising terms that are common in a particular document but less common in many others, TF-IDF aims to make those words more significant or unique to that document.

Components of TF-IDF

A term’s term frequency (TF) indicates how often it occurs in a document. It is the document’s word count in its raw form.

Term Frequency Formula
Term Frequency Formula

By using Inverse Document Frequency (IDF), terms that occur often in several documents are given less weight. A word is deemed less informative if it appears frequently in papers.

Inverse Document Frequency Formula
Inverse Document Frequency Formula

TF-IDF Rating

The TF and IDF values of a word in a document are multiplied to determine its final TF-IDF score:

TF-IDF Score Formula
TF-IDF Score Formula

Higher TF-IDF scores will be assigned to words that appear frequently in one document but infrequently in many others.

Example of TF-IDF Calculatio

Assume you have three texts in your corpus:

Objective

The TF-IDF score for the word “machine” in the first document is what it wish to determine.

Methodical Calculation

Step 1: Calculate Term Frequency (TF)

The term “machine” appears once out of four words in the first document. Thus:

 Term Frequency Formula
Term Frequency Formula

Step 2: Calculate Inverse Document Frequency (IDF)

Two of the three documents (documents 1 and 3) contain the word “machine.” Thus:

 Inverse Document Frequency Formula
Inverse Document Frequency Formula

Step 3: Calculate TF-IDF

Increase TF and IDF by: Therefore, “machine” in Document 1 has a TF-IDF score of 0.044.

TF-IDF Formula
TF-IDF Formula

Practical Example in Python

The scikit-learn library’s TfidfVectorizer is used to implement the TF-IDF idea in Python. A detailed tutorial on TF-IDF implementation and Linear text classification is provided below.

Walkthrough of the Code

Import Required Libraries: To apply machine learning models like Naive Bayes and to perform vectorisation (TF-IDF), require a number of libraries.

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.model_selection import train_test_split 
from sklearn.naive_bayes import MultinomialNB 
from sklearn.metrics import accuracy_score, classification_report

Sample data (replace with your own dataset)

documents = ["This is a positive document.", "Negative sentiment detected.",
"Another positive example.", "Negative review here." ]
labels = [1, 0, 1, 0]  # 1 for positive, 0 for negative

Split the Data: Divide the data into training and testing sets in order to train and evaluate the model. The dataset is automatically divided by the train_test_split algorithm.

X_train, X_test, y_train, y_test = train_test_split(documents, labels, 
test_size=0.2, random_state=42)

TF-IDF Vectorization: Next, use TF-IDF to transform the text data into numerical format.

vectorizer = TfidfVectorizer() 
X_train_tfidf = vectorizer.fit_transform(X_train) 
X_test_tfidf = vectorizer.transform(X_test)
  • The model learns the vocabulary and performs the TF-IDF transformation when fit_transform is applied to the training data.
  • Using the same vocabulary acquired from the training data, transform is performed to the test data.

Train the Naive Bayes Classifier: I’m using the Multinomial Naive Bayes algorithm, which is frequently used for Linear text classification. You can train any classifier you like, including logistic regression, SVM, MLP, etc.

classifier = MultinomialNB() 
classifier.fit(X_train_tfidf, y_train)

Make Predictions: Utilise the model to forecast the test data’s labels following training.

predictions = classifier.predict(X_test_tfidf)

Evaluate the Model: To determine how well the classifier performed, use the accuracy score and classification report to assess the model’s performance.

accuracy = accuracy_score(y_test, predictions)
report = classification_report(y_test, predictions)  
print(f"Accuracy: {accuracy:.2f}") print("Classification Report:\n", report)

Practical Example Summary

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
# Sample data
documents = [
    "This is a positive document.",
    "Negative sentiment detected.",
    "Another positive example.",
    "Negative review here."
]
labels = [1, 0, 1, 0]  # 1 for positive, 0 for negative
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(documents, labels, test_size=0.2, random_state=42)
# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
# Train the Naive Bayes Classifier
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)
# Make predictions
predictions = classifier.predict(X_test_tfidf)
# Evaluate the classifier
accuracy = accuracy_score(y_test, predictions)
report = classification_report(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", report)
Index