
Exploratory Data Analysis (EDA) is an important step in the workflow of any Data Science project. However, when working with text data in a Natural Language Processing (NLP) project, you need to apply different techniques than when working with e.g. tabular data.
When working with text data […] you need to apply different techniques than when working with e.g. tabular data.
Therefore, we will look at some fundamental EDA techniques for text data in this article:
- Counts and Lengths: We’ll look at character counts, word counts, sentence counts, and string counts and also at the average word and sentence lengths.
- Term Frequency Analysis: We’ll look at the most frequent words and n-grams and discuss why you don’t need word clouds.
For this article, we will use the Women’s E-Commerce Clothing Reviews Dataset from Kaggle.
To simplify the examples, we will use 450 positive reviews (rating == 5) and 450 negative reviews (rating == 1). This reduces the number of data points to 900 rows, reduces the number of rating classes to two, and balances the positive and negative reviews.
Additionally, we will only use two columns: the review text and the rating.
The DataFrame’s head of the reduced dataset looks like this:

Counts and Lengths
Let’s start with some basic counts and lengths. We’ll go through each feature with a simple example review text:
text = "Very #comfortable and #versatile. Got lots of compliments."Character Count
You can start by counting all the characters in a text.
char_count = len(text)For our example review text, the char_count = 58.
Word Count
Next, you can count all the words in a text.
word_count = len(text.split())For our example review text, the word_count = 8.
Sentence Count
If you have longer texts, you can also count the number of sentences. For this, you need to import the sent_tokenize function from NLTK.
import nltk
from nltk.tokenize import sent_tokenizesent_count = len(sent_tokenize(text))For our example review text, the sent_count = 2.
String Counts
You can also count specific characters or strings. E.g. you could count the number of hashtags by counting the character “#”. You could also count mentions (“@”) or websites (“http”) and so on.
hashtag_count = text.count("#")For our example review text, the hashtag_count = 2.
Average Word Length
You can also count the average length of a word.
import numpy as npavg_word_len = np.mean([len(w) for w in str(text).split()])For our example review text, the avg_word_len = 6.375.
Average Sentence Length
If you have longer texts, you can also count the average length of a sentence.
avg_sent_len = np.mean([len(w.split()) for w in sent_tokenize(text)])For our example review text, the avg_sent_len = 4.0.
With the map() function, you can apply all of the above techniques to the text column in your pandas DataFrame:
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize# Character counts
df["char_count"] = df["text"].map(lambda x: len(x))# Word counts
df["word_count"] = df["text"].map(lambda x: len(x.split()))# Sentence counts
df["sent_count"] = df["text"].map(lambda x: len(sent_tokenize(x)))# String counts
df["hashtag_count"] = df["text"].map(lambda x: x.count("#"))# Average word length
df["avg_word_len"] = df["text"].map(lambda x: np.mean([len(w) for w in str(x).split()]))# Average sentence length
df["avg_sent_len"] = df["text"].map(lambda x: np.mean([len(w.split()) for w in sent_tokenize(x)]))After this, your initial DataFrame could look something like this:

Now, you can explore these new features with e.g., histograms, KDE plots, or boxplots. If you are working on a text classification problem, you can also differentiate by class when you visualize the data.
import seaborn as sns# Histogram
sns.histplot(data = df, x = feature, hue = class)# KDE plot
sns.kdeplot(data = df, x = feature, hue = class)# Boxplot
sns.boxplot(data = df, x = class, y = feature)Below you can see the KDE plots for our example.

Term Frequency Analysis
This is the part where you might be tempted to use a word cloud. Don’t.
I have dedicated a whole section on its own about why I would avoid word clouds after this one. But let’s first talk about how to explore and visualize the most frequent terms.
Before we begin, we need to preprocess the text by changing everything to lowercase, and removing all punctuations and non-Roman characters.
import re
import stringdef clean_text(text):
# Convert text to lowercase
text = text.lower() # Remove punctuation
text = re.sub("[%s]" % re.escape(string.punctuation), "", text) # Remove non-Roman characters
text = re.sub("([^\x00-\x7F])+", " ", text)
return textdf["text_clean"] = df["text"].map(lambda x: clean_text(x))The cleaned review texts look like this:

Most Frequent Words
To get the most frequent words, you first need to create a so-called “corpus”. That means we create a list containing all relevant words from the cleaned review texts. By “relevant” word, I mean words that aren’t stopwords like e.g., “is”, “for”, “a”, “and”.
from nltk.corpus import stopwordsstop = set(stopwords.words("english"))corpus = [word for i in df["text_clean"].str.split().values.tolist() for word in i if (word not in stop)]The corpus looks like this:

Now, to get the most common words from the corpus, you have two options:
You can either use the FreqDist class:
from nltk.probability import FreqDist
most_common = FreqDist(corpus).most_common(10)Or you can use the Counter class:
from collections import Counter
most_common = Counter(corpus).most_common(10)With the most_common(10) function, the top 10 most common words and their frequencies will be returned.

With this, you can easily create a barplot of the most common words:
words, frequency = [], []
for word, count in most_common:
words.append(word)
frequency.append(count)
sns.barplot(x = frequency, y = words)Below, you can see the top 10 most common words for the negative and positive reviews:

From this technique, we can see that in both positive and negative reviews the items like “dress” and “top” are most commonly mentioned. However, in the positive reviews adjectives like “great” and “perfect” are mentioned a lot, which is not the case in the negative reviews.
Most Frequent N-Grams
Let’s do the same for n-grams. What’s an n-gram? It’s a sequence of n words in a text.
E.g., the bi-grams (n = 2) for the sentence “How are you today?” would be: “How are”, “are you”, and “you today”. The tri-grams (n =3) would be “How are you” and “are you today”.
To separate the texts into n-grams, you can use the CountVectorizer class as shown below.
In the ngram_range you can define the n-grams to consider. E.g. ngram_range = (2, 2) only considers bi-grams, ngram_range = (3, 3) only considers tri-grams, and ngram_range = (2, 3) considers bi-grams and tri-grams.
from sklearn.feature_extraction.text import CountVectorizer# Initialize CountVectorizer
vec = CountVectorizer(stop_words = stop, ngram_range = (2, 2))# Matrix of ngrams
bow = vec.fit_transform(df["text_clean"])# Count frequency of ngrams
count_values = bow.toarray().sum(axis=0)# Create DataFrame from ngram frequencies
ngram_freq = pd.DataFrame(sorted([(count_values[i], k) for k, i in vec.vocabulary_.items()], reverse = True))
ngram_freq.columns = ["frequency", "ngram"]The resulting DataFrame ngram_freq looks like this:

Below you can see the bi- and tri-grams separated by positive and negative reviews:


As you can see, this EDA technique helps you understand the different tones of the reviews.
Miss Me With Those Word Clouds

Notice, how we did not talk about word clouds? Yes – you can actually perform an EDA on text data without using a word cloud.
Word clouds are confusing. While more frequent terms are displayed in a larger font size than less frequent terms, it is difficult to grasp the order among similarly frequent words.
A simple bar chart may not be as flashy as a word cloud but it does a much better job at visualizing the exact rank and the term’s frequency.
So, unless your management really wants to see a word cloud, I would recommend using a bar chart instead.
Unless your management really wants to see a word cloud, I would recommend using a bar chart instead.
Conclusion
In this article, we looked at some fundamental EDA techniques for text data:
- Counts and Lengths: We looked at character counts, word counts, sentence counts, and string counts and also at the average word and sentence lengths.
- Term Frequency Analysis: We looked at the most frequent words and n-gram and discussed why you don’t need word clouds.
Below you can find all code snippets for quick copying:
You can find the continuation of this article in this article about “Intermediate EDA Techniques for NLP”:
This blog was originally published on Towards Data Science on Aug 31, 2022 and moved to this site on Feb 1, 2026.