CountVectorizer to Extract Features from Texts in Python, in Detail | by Rashida Nasrin Sucky

Photo by Towfiqu barbhuiya on Unsplash

Everything you need to know to use CountVectorizer efficiently in Sklearn

Rashida Nasrin Sucky

The most basic data processing that any Natural Language Processing (NLP) project requires is to convert the text data to the numeric data. As long as the data is in text form we cannot do any kind of computation action on it.

There are multiple methods available for this text-to-numeric data conversion. This tutorial will explain one of the most basic vectorizers, the CountVectorizer method in the scikit-learn library.

This method is very simple. It takes the frequency of occurrence of each word as the numeric value. An example will make it clear.

In the following code block:

  • We will import the CountVectorizer method.
  • Call the method.
  • Fit the text data to the CountVectorizer method and, convert that to an array.
import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer

#This is the text to be vectorized
text = ["Hello Everyone! This is Lilly. My aunt's name is also Lilly. I love my aunt.\
I am trying to learn how to use count vectorizer."]

cv= CountVectorizer()
count_matrix = cv.fit_transform(text)
cnt_arr = count_matrix.toarray()


array([[1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1]],

Here I have the numeric values representing the text data above.

How do we know which values represent which words in the text?

To make that clear, it will be helpful to convert the array into a DataFrame where column names will be the words themselves.

cnt_df = pd.DataFrame(data = cnt_arr, columns = cv.get_feature_names())

Now, it shows clearly. The value of the word ‘also’ is 1 which means ‘also’ appeared only once in the test. The word ‘aunt’ came twice in the text. So, the value of the word ‘aunt’ is 2.

In the last example, all the sentences were in one string. So, we got only one row of data for four sentences. Let’s rearrange the text and…

Source link