Fundamental Data Analysis Techniques for Text Data

Exploratory Data Analysis (EDA) is an important step in the workflow of any Data Science project. However, when working with text data in a Natural Language Processing (NLP) project, you need to apply different techniques than when working with e.g. tabular data.

When working with text data […] you need to apply different techniques than when working with e.g. tabular data.

Therefore, we will look at some fundamental EDA techniques for text data in this article:

Counts and Lengths: We’ll look at character counts, word counts, sentence counts, and string counts and also at the average word and sentence lengths.
Term Frequency Analysis: We’ll look at the most frequent words and n-grams and discuss why you don’t need word clouds.

For this article, we will use the Women’s E-Commerce Clothing Reviews Dataset from Kaggle.

To simplify the examples, we will use 450 positive reviews (rating == 5) and 450 negative reviews (rating == 1). This reduces the number of data points to 900 rows, reduces the number of rating classes to two, and balances the positive and negative reviews.

Additionally, we will only use two columns: the review text and the rating.

The DataFrame’s head of the reduced dataset looks like this:

Head of Simplified Women’s E-Commerce Clothing Reviews Dataset (Image by the author)

Counts and Lengths

Let’s start with some basic counts and lengths. We’ll go through each feature with a simple example review text:

text = "Very #comfortable and #versatile. Got lots of compliments."

Character Count

You can start by counting all the characters in a text.

char_count = len(text)

For our example review text, the char_count = 58.

Word Count

Next, you can count all the words in a text.

word_count = len(text.split())

For our example review text, the word_count = 8.

Sentence Count

If you have longer texts, you can also count the number of sentences. For this, you need to import the sent_tokenize function from NLTK.

import nltk  
from nltk.tokenize import sent_tokenizesent_count = len(sent_tokenize(text))

For our example review text, the sent_count = 2.

String Counts

You can also count specific characters or strings. E.g. you could count the number of hashtags by counting the character “#”. You could also count mentions (“@”) or websites (“http”) and so on.

hashtag_count = text.count("#")

For our example review text, the hashtag_count = 2.

Average Word Length

You can also count the average length of a word.

import numpy as npavg_word_len = np.mean([len(w) for w in str(text).split()])

For our example review text, the avg_word_len = 6.375.

Average Sentence Length

If you have longer texts, you can also count the average length of a sentence.

avg_sent_len = np.mean([len(w.split()) for w in sent_tokenize(text)])

For our example review text, the avg_sent_len = 4.0.

With the map() function, you can apply all of the above techniques to the text column in your pandas DataFrame:

import numpy as np  
import nltk  
from nltk.tokenize import sent_tokenize# Character counts  
df["char_count"] = df["text"].map(lambda x: len(x))# Word counts  
df["word_count"] = df["text"].map(lambda x: len(x.split()))# Sentence counts  
df["sent_count"] = df["text"].map(lambda x: len(sent_tokenize(x)))# String counts  
df["hashtag_count"] = df["text"].map(lambda x: x.count("#"))# Average word length  
df["avg_word_len"] = df["text"].map(lambda x: np.mean([len(w) for w in str(x).split()]))# Average sentence length  
df["avg_sent_len"] = df["text"].map(lambda x: np.mean([len(w.split()) for w in sent_tokenize(x)]))

After this, your initial DataFrame could look something like this:

Head of DataFrame after creating new features from texts (Image by the author)

Now, you can explore these new features with e.g., histograms, KDE plots, or boxplots. If you are working on a text classification problem, you can also differentiate by class when you visualize the data.

import seaborn as sns# Histogram  
sns.histplot(data = df, x = feature, hue = class)# KDE plot  
sns.kdeplot(data = df, x = feature, hue = class)# Boxplot  
sns.boxplot(data = df, x = class, y = feature)

Below you can see the KDE plots for our example.

KDE plots of newly created features separated by "rating" class (Image by the author) — KDE plots of newly created features separated by “rating” class (Image by the author)

Term Frequency Analysis

This is the part where you might be tempted to use a word cloud. Don’t.

I have dedicated a whole section on its own about why I would avoid word clouds after this one. But let’s first talk about how to explore and visualize the most frequent terms.

Before we begin, we need to preprocess the text by changing everything to lowercase, and removing all punctuations and non-Roman characters.

import re  
import stringdef clean_text(text):  
    # Convert text to lowercase  
    text = text.lower()    # Remove punctuation  
    text = re.sub("[%s]" % re.escape(string.punctuation), "", text)    # Remove non-Roman characters  
    text = re.sub("([^\x00-\x7F])+", " ", text)  
      
    return textdf["text_clean"] = df["text"].map(lambda x: clean_text(x))

The cleaned review texts look like this:

Original review texts on the left and cleaned review texts on the (Image by the author)

Most Frequent Words

To get the most frequent words, you first need to create a so-called “corpus”. That means we create a list containing all relevant words from the cleaned review texts. By “relevant” word, I mean words that aren’t stopwords like e.g., “is”, “for”, “a”, “and”.

from nltk.corpus import stopwordsstop = set(stopwords.words("english"))corpus = [word for i in df["text_clean"].str.split().values.tolist() for word in i if (word not in stop)]

The corpus looks like this:

Beginning of corpus (Image by the author)

Now, to get the most common words from the corpus, you have two options:

You can either use the FreqDist class:

from nltk.probability import FreqDist  
most_common = FreqDist(corpus).most_common(10)

Or you can use the Counter class:

from collections import Counter  
most_common = Counter(corpus).most_common(10)

With the most_common(10) function, the top 10 most common words and their frequencies will be returned.

With this, you can easily create a barplot of the most common words:

words, frequency = [], []  
for word, count in most_common:  
    words.append(word)  
    frequency.append(count)  
      
sns.barplot(x = frequency, y = words)

Below, you can see the top 10 most common words for the negative and positive reviews:

Most frequent words separated by "Rating" class (Image by the author). — Most frequent words separated by “Rating” class (Image by the author).

From this technique, we can see that in both positive and negative reviews the items like “dress” and “top” are most commonly mentioned. However, in the positive reviews adjectives like “great” and “perfect” are mentioned a lot, which is not the case in the negative reviews.

Most Frequent N-Grams

Let’s do the same for n-grams. What’s an n-gram? It’s a sequence of n words in a text.

E.g., the bi-grams (n = 2) for the sentence “How are you today?” would be: “How are”, “are you”, and “you today”. The tri-grams (n =3) would be “How are you” and “are you today”.

To separate the texts into n-grams, you can use the CountVectorizer class as shown below.

In the ngram_range you can define the n-grams to consider. E.g. ngram_range = (2, 2) only considers bi-grams, ngram_range = (3, 3) only considers tri-grams, and ngram_range = (2, 3) considers bi-grams and tri-grams.

from sklearn.feature_extraction.text import CountVectorizer# Initialize CountVectorizer  
vec = CountVectorizer(stop_words = stop, ngram_range = (2, 2))# Matrix of ngrams  
bow = vec.fit_transform(df["text_clean"])# Count frequency of ngrams  
count_values = bow.toarray().sum(axis=0)# Create DataFrame from ngram frequencies  
ngram_freq = pd.DataFrame(sorted([(count_values[i], k) for k, i in vec.vocabulary_.items()], reverse = True))  
ngram_freq.columns = ["frequency", "ngram"]

The resulting DataFrame ngram_freq looks like this:

Head of DataFrame containing bi-grams (Image by the author)

Below you can see the bi- and tri-grams separated by positive and negative reviews:

Most frequent bi-grams by "Rating" class (Image by the author). — Most frequent bi-grams by “Rating” class (Image by the author).

Most frequent tri-grams by "Rating" class (Image by the author). — Most frequent tri-grams by “Rating” class (Image by the author).

As you can see, this EDA technique helps you understand the different tones of the reviews.

Miss Me With Those Word Clouds

Notice, how we did not talk about word clouds? Yes – you can actually perform an EDA on text data without using a word cloud.

Word clouds are confusing. While more frequent terms are displayed in a larger font size than less frequent terms, it is difficult to grasp the order among similarly frequent words.

A simple bar chart may not be as flashy as a word cloud but it does a much better job at visualizing the exact rank and the term’s frequency.

So, unless your management really wants to see a word cloud, I would recommend using a bar chart instead.

Unless your management really wants to see a word cloud, I would recommend using a bar chart instead.

Conclusion

In this article, we looked at some fundamental EDA techniques for text data:

Counts and Lengths: We looked at character counts, word counts, sentence counts, and string counts and also at the average word and sentence lengths.
Term Frequency Analysis: We looked at the most frequent words and n-gram and discussed why you don’t need word clouds.

Below you can find all code snippets for quick copying:

You can find the continuation of this article in this article about “Intermediate EDA Techniques for NLP”:

This blog was originally published on Towards Data Science on Aug 31, 2022 and moved to this site on Feb 1, 2026.