Intermediate Data Analysis Techniques for Text Data

Exploratory Data Analysis (EDA) for text data is more than counting characters and terms. To take your EDA to the next level, you can look at each word and categorize it or you can analyze the overall sentiment of a text.

Exploratory Data Analysis for text data is more than counting characters and terms.

In this article, we will look at some intermediate EDA techniques for text data:

Part-of-Speech Tagging: We’ll look at Part-of-Speech (POS) tagging and how to use it to get the most frequent adjectives, nouns, verbs, etc.
Sentiment Analysis: We’ll look at sentiment analysis and explore whether the dataset has a positive or negative tendency.

This article is a continuation of my previous story on “Fundamental EDA Techniques for NLP”. You can read it here:

As in the previous article, we will use the Women’s E-Commerce Clothing Reviews Dataset from Kaggle for this article again.

To simplify the examples, we will use 450 positive reviews (rating == 5) and 450 negative reviews (rating == 1). This reduces the number of data points to 900 rows, reduces the number of rating classes to two, and balances the positive and negative reviews.

Additionally, we will only use two columns: the review text and the rating.

The DataFrame’s head of the reduced dataset looks like this:

Head of Simplified Women’s E-Commerce Clothing Reviews Dataset (Image by the author)

Part-of-Speech Tagging

In the fundamental EDA techniques, we covered the most frequent words and bi-grams and noticed that adjectives like “great” and “perfect” were among the most frequent words in the positive reviews.

With POS tagging, you could refine the EDA on the most frequent terms. E.g., you could explore, which adjectives or verbs are most common.

POS tagging takes every token in a text and categorizes it as nouns, verbs, adjectives, and so on, as shown below:

If you are curious about how I visualized this sentence, you can check out my tutorial here:

To check which POS tags are the most common, we will start by creating a corpus of all review texts in the DataFrame:

corpus = df["text"].values.tolist()

Corpus as List of review texts (Image by the author)

Next, we’ll tokenize the entire corpus as preparation for POS tagging.

from nltk import word_tokenize  
tokens = word_tokenize(" ".join(corpus))

Then, we’ll POS tag each token in the corpus with the coarse tag set “universal”:

import nltk  
tags = nltk.pos_tag(tokens,   
                    tagset = "universal")

As in the Term Frequency analysis of the previous article, we will create a list of tags by removing all stopwords. Additionally, we will only include words of a specific tag, e.g. adjectives.

Then all we have to do is to use the Counter class as in the previous article.

from collections import Countertag = "ADJ"  
stop = set(stopwords.words("english"))# Get all tokens that are tagged as adjectives  
tags = [word for word, pos in tags if ((pos == tag) & ( word not in stop))]# Count most common adjectives  
most_common = Counter(tags).most_common(10)# Visualize most common tags as bar plots  
words, frequency = [], []  
for word, count in most_common:  
    words.append(word)  
    frequency.append(count)  
      
sns.barplot(x = frequency, y = words)

Below, you can see the top 10 most common adjectives for the negative and positive reviews:

Most frequent adjectives separated by "Rating" class (Image by the author). — Most frequent adjectives separated by “Rating” class (Image by the author).

From this technique, we can see that words like “small”, “fit”, “big”, and “large” are most common. This might indicate that customers are most disappointed about a piece of clothing’s fit than e.g. about its quality.

Sentiment Analysis

The main idea of sentiment analysis is to get an understanding of whether a text has a positive or negative tone. E.g., the sentence “I love this top.” has a positive sentiment, and the sentence “I hate the color.” has a negative sentiment.

You can use TextBlob for simple sentiment analysis as shown below:

from textblob import TextBlobblob = TextBlob("I love the cut")blob.polarity

Polarity is an indicator of whether a statement is positive or negative and is a number between -1 (negative) and 1 (positive). The sentence “I love the cut” has a polarity of 0.5, while the sentence “I hate the color” has a polarity of -0.8.

The combined sentence “I love the cut but I hate the color” has a polarity of -0.15.

For multiple sentences in a text, you can get the polarity of each sentence as shown below:

text = "I love the cut. I get a lot of compliments. I love it."  
[sentence.polarity for sentence in TextBlob(text).sentences]

This code returns an array of polarities of [0.5, 0.0, 0.5]. That means that the first and last sentences have a positive sentiment while the second sentence has a neutral sentiment.

If we apply this sentiment analysis to the whole DataFrame like this,

df["polarity"] = df["text"].map(lambda x: np.mean([sentence.polarity for sentence in TextBlob(x).sentences]))

we can plot a boxplot comparison with the following code:

sns.boxplot(data = df,   
            y = "polarity",   
            x = "rating")

Below, you can see the polarity boxplots for the negative and positive reviews:

Polarity separated by "Rating" class (Image by the author). — Polarity separated by “Rating” class (Image by the author).

As you would expect, we can see that negative reviews (rating == 1) have an overall lower polarity than positive reviews (rating == 5).

Conclusion

In this article, we looked at some intermediate EDA techniques for text data:

Part-of-Speech Tagging: We looked at Part-of-Speech tagging and how to use it to get the most frequent adjectives as an example.
Sentiment Analysis: We looked at sentiment analysis and explored the review texts’ polarities.

Below you can find all code snippets for quick copying:

This blog was originally published on Towards Data Science on Sep 13, 2022 and moved to this site on Feb 1, 2026.