99 Lessons on Data Analysis from Placing Top 5 in 5 Kaggle Analytics Challenges

(Grand)Masterclass: How to approach (and win) a Kaggle Analytics Competition

99 practical tips on data analysis, storytelling with data, and effective visualizations from placing top 5 in five Kaggle Analytics competitions.
Towards Data Science Archive
Published

August 16, 2022

(Image by the author)

I have to agree with the critics: Kaggle Analytics challenges are only distantly related to writing real-world data analysis reports. But I like them because they can teach you a lot about the fundamentals of telling a story with data.

Kaggle Analytics challenges are only distantly related to writing real-world data analysis reports. But […] they can teach you a lot […].

This article was originally going to be a proper article with paragraphs and images. But the first draft already had over 7,000 words, so a listicle is what you get instead.

In this article, I share the “secret sauce” that got me in the top 5 [1] of five Kaggle Analytics challenges until now.

Don’t misinterpret this list as a collection of rules. I haven’t always followed these tips in my previous winning Kaggle Notebooks. Instead, this list is a collection of lessons I have learned along the way.

For this article, I will provide code snippets in Python with functions from the Pandas, Matplotlib, and Seaborn libraries. Feel free to use whatever programming language and complementing visualization libraries you like.

import pandas as pd  
import matplotlib.pyplot as plt  
import seaborn as sns

Preparations

  1. Understand the difference between exploratory and explanatory data analysis.

“When we do exploratory analysis, it’s like hunting for pearls in oysters. We might have to open 100 oysters […] to find perhaps two pearls. When we’re at the point of communicating our analysis to our audience, we really want to be in the explanatory space, meaning you have a specific thing you want to explain, a specific story you want to tell – probably about those two pearls.” – [3]

  1. Read the problem statement – understand the problem statement.

  2. Get an overview of the whole dataset:

  • What is the file structure?
  • How many files does the whole dataset have?
  • What relationship do the files have?
  • What are common key columns?
  1. Get an overview of each file in the dataset with df.info():
  • How many columns are in a file?
  • How many rows are in a file?
  • What do the column names mean?
  • What type of data do you have (e.g., numerical, categorical, time series, text, etc.)?
  1. Read the dataset’s description.

  2. Check unique values with df.nunique() for plausibility and cardinality.

  3. Get an overview of missing values.

# Display the percentage of missing values  
df.isna().sum(axis = 0) / len(df) * 100# Visualize the missing values  
sns.heatmap(df.isna(),         
            cbar = False,      
            cmap = "binary")
  1. You don’t have to look at all the data if you have a large dataset– but don’t be lazy with your selection.

  2. Be prepared that not every plot you create is going to make it into the final report. Make a lot of plots anyways.

  3. You only need three types of plots for univariate analysis: Histograms or boxplots for numerical features and bar charts (count plots) for categorical features.

# Explore numerical features  
sns.histplot(data = df,   
             x = num_feature)  
sns.boxplot(data = df,   
            y = num_feature)# Explore categorical features  
sns.countplot(data = df,   
              x = cat_feature)
  1. Document what you are doing in the code – you’ll thank yourself later.

  2. Data cleaning and feature engineering should happen naturally during exploratory data analysis (EDA).

  3. Numerical features can be disguised as categorical features and vice versa.

  4. Be on the lookout for NaN values disguised as implausible values (e.g. -1, 0, or 999). Sometimes, they will show themselves as suspicious peaks in otherwise unsuspicious looking distributions.

# Replace invalid values with NaN  
invalid_value = 9999df["feature"] = np.where((df["feature"] == invalid_value),   
                          np.nan,   
                          df["feature"])
  1. Don’t ignore outliers. You just might find something interesting.

  2. Look at edge cases (top 5 and bottom 5).

  3. Create new features by either splitting a single feature into multiple features or combining multiple features into a new one.

# Splitting features  
df["main_cat"] = df[feature].map(lambda x: x.split('_')[0])  
df["sub_cat"] = df[feature].map(lambda x: x.split('_')[1])df[["city", "country"]] = df["address"].str.split(', ', expand=True)# Combining features  
df["combined"] = df[["feature_1", "feature_2"].agg('_'.join, axis=1)  
df["ratio"] = df["feature_1"] / df["feature_2"]
  1. Create count and length features from text features.
# Creating word count and character count features  
df["char_cont"] = df["text"].map(lambda x: len(x))  
df["word_count"] = df["text"].map(lambda x: len(x.split(' ')))
  1. datetime64[ns] features contain a lot of new features:
# Convert to datetime data type  
df["date"] = pd.to_datetime(df["date"],   
                            format = '%Y-%m-%d')# Creating datetimeindex features  
df["year"] = pd.DatetimeIndex(df["date"]).year  
df["month"] = pd.DatetimeIndex(df["date"]).month  
df["day"] = pd.DatetimeIndex(df["date"]).dayofyear  
df["weekday"] = pd.DatetimeIndex(df["date"]).weekday  
# etc.
  1. For coordinates the first number is always the latitude and the second is the longitude (but latitude corresponds to the y-axis and longitude to the x-axis when you plot the coordinates).
# Splitting coordinates into longitude and latitude  
df["lat"] = df["coord"].map(lambda x: x.split(", ")[0]))  
df["lon"] = df["coord"].map(lambda x: x.split(", ")[1]))
  1. Extend your dataset with additional data. It demonstrates your creativity. You have three options to get additional data (in order of descending effort):
  • Create your own dataset,
  • find a dataset and import it to Kaggle,
  • or use a dataset that is already available on Kaggle (I prefer this one).
  1. Don’t lose valuable data points when merging two DataFrames. Also, make sure that the spelling of the key column matches in both DataFrames. Double-check your work by checking the resulting DataFrame’s length with len(df).

  2. Review the evaluation criteria.

Before you start exploring data (Image by the author)

Exploratory Data Analysis

  1. Accept that you (probably) can’t look at all the relationships in the data. Looking at every possible combination will scale up exponentially with (n + (n over 2) + (n over 3)) if n is the number of features.

  2. Start by plotting a few (random) relationsships – just to get comfortable with the data.

  3. Don’t waste time on creating fancy data visualizations during the EDA (I promise we’ll get there later).

  4. Domain knowledge is king. Do some research to get familiar with the topic.

  5. Invest some time to think about what aspects you want to explore. Brainstorm a line of questioning that’s worth answering. If you don’t know where to start, start with the research questions suggested by the challenge host.

  6. Get an overview of possible relationships to look at before you begin with the multivariate analysis (but keep in mind that both of the following methods will only consider the numerical features).

# Display pair plot  
sns.pairplot(df)# Display correlation matrix  
sns.heatmap(df.corr(),           
            annot = True,        
            fmt = ".1f",         
            cmap = "coolwarm",   
            vmin = -1,           
            vmax = 1)
  1. Take notes of your findings in bullet point form after each plot.

  2. You only need four types of plots for bivariate analysis:

  • Scatterplots for the relationship between two numerical features
sns.scatterplot(data = df,   
                x = "num_feature_1",   
                y = "num_feature_2")
  • Boxplots for the relationship between a categorical and a numerical feature
sns.boxplot(data = df,   
            x = "cat_feature",   
            y = "num_feature")
  • Heatmaps or bar charts for the relationship between two categorical features
temp = pd.crosstab(index = df["cat_feature_1"],  
                   columns = df["cat_feature_2"])  
  
# Bar chart  
temp.plot.bar()# Heatmap  
sns.heatmap(temp, annot = True)
  1. The groupby() method is your friend for multivariate analysis.
# How many feature_1 per feature_2?  
df.groupby("feature_2")["feature_1"].nunique()# What is the average feature_1 for each feature_2?  
df.groupby("feature_2")["feature_1"].mean()
  1. Got time series data? Conduct a trend analysis using line plots.
sns.lineplot(data = df,   
             x = "time_feature",   
             y = "num_feature")
  1. You can conduct multivariate analysis without learning any new plots:
  • Scatterplot with hue or size for relationship between three numerical features
sns.scatterplot(data = df,   
                x = "num_feature_1",   
                y = "num_feature_2",  
                hue = "num_feature_3")sns.scatterplot(data = df,   
                x = "num_feature_1",   
                y = "num_feature_2",  
                size = "num_feature_3")
  • Scatterplot with hue or style for relationship between two numerical features and a categorical feature
sns.scatterplot(data = df,   
                x = "num_feature_1",   
                y = "num_feature_2",  
                style = "cat_feature")sns.scatterplot(data = df,   
                x = "num_feature_1",   
                y = "num_feature_2",  
                hue = "cat_feature")
  • Grouped bar charts or boxplots for relationship between two categorical features and a numerical feature
sns.barplot(data = df,  
            x = "cat_feature_1",   
            y = "num_feature",   
            hue = "cat_feature_2")sns.boxplot(data = df,  
            x = "cat_feature_1",   
            y = "num_feature",   
            hue = "cat_feature_2")
  • Stacked grouped bar charts for relationship between three categorical features
  1. Always doubt your findings. Take some time to sanity-check and double-check your plots for data fallacies like Simpson’s paradox [2].

“A phenomenon in which a trend appears in different groups of data but disappears or reverses when the groups are combined.” – [2]

  1. Read, read, read. Extend your EDA with research.

  2. Modeling can be useful for data analysis. E.g., you can build a linear regression model to predict the value for the next year, you can apply clustering to create a new feature, or you can use feature importances to gain insights.

  3. Take a minute to make sure you are interpreting the plots correctly.

  4. Make sure you have a sufficient amount of plots (or insights to build a story around them).

  5. Refactor your code. It helps you detect errors and makes your code more accessible and reproducible for your audience.

  6. It’s OK if the variety of data visualizations is underwhelming at this stage.

Connecting the dots – The insights after EDA (Image by the author)

Explanatory Data Analysis

  1. Storytelling is more important than data visualizations – Trust me, I won my first prize with mostly bar charts.

  2. Pick a clear topic and build a story around it. Make sure the question you are answering is useful to the competition host.

  3. The entry for a Kaggle Analytics challenge is not a collection of plots. It needs an introduction, a body, and a conclusion.

  4. Create an outline based on your findings.

  5. Don’t hesitate to discard most of your plots so far.

  6. Tell your audience about the dataset. What data are you working with? How many data points are there? Did you add external data?

  7. Explain what you did.

  8. Show what you didn’t see. Did you have an interesting hypothesis but the data didn’t support it? Show that and discuss it.

  9. Don’t include a point just because you think the data visualization is cool. If the finding is not relevant to the overall story you are telling, cut it.

  10. Write a first draft.

  11. Now is the time for the fancy plots (see, I kept my promise).

  12. You’ll make better data visualizations if you know what you want to show.

  13. Double-check if the metric you are using is suitable for what you want to show. E.g., to measure a platform’s popularity would you use the total number of accounts or the average number of daily active users?

  14. Avoid vanity metrics.

  15. Get some inspiration from the pros – but understand that boring data visualizations (aka bar charts) are usually the most effective.

  16. Decide which data visualization to use based on what (distribution, relationship, composition, comparison) you want to show.

  17. Your best bets are these six types of plots and their variations: bar charts, heatmaps, histograms, scatterplots, boxplots, and line plots.

  18. Remember that single numbers and tables can be data visualizations, too.

  19. Please don’t use pie charts. Also, please don’t use donut charts. If your plot is named after a dessert, don’t use it (and when you do, know that you shouldn’t).

  20. Replace word clouds with bar charts.

  21. Please, please, please don’t use 3D effects.

  22. Use choropleth maps intentionally (and not just because you have geographical data).

  23. Define a color palette. Have at least one highlight color and one contrast color.

# Set color theme  
highlight_color = "#1ebeff"  
contrast_color = '#fae042'from matplotlib.colors import LinearSegmentedColormap  
custom_palette = LinearSegmentedColormap.from_list(" ",  [highlight_color,  '#ffffff', contrast_color])  
plt.cm.register_cmap("custom_palette", custom_palette)
  1. Make sure to use the right color palette for your purpose:
  • Sequential for ordered values (e.g. 1, 2, 3, …, )
  • Diverging for opposing values with a neutral mid-value (e.g. -1, 0, 1)
  • Qualitative for categorical values
  1. Grey is your friend when you need to remove the focus from context information.

  2. Make sure your colors are colorblind- and photocopy-safe.

  3. Visualize like an adult: Add a title and labels to your plot.

  4. Add a legend if suitable.

  5. Set an appropriate font size.

plt.rcParams.update({"font.size": 14})
  1. Keep it simple. Remove any distractions and redundancies from the plot.

  2. Don’t start the quantitative axis of your bar charts anywhere other than 0.

  3. Don’t compare two plots with different axis ranges.

  4. Don’t mislead your audience with your data visualizations, e.g. by ignoring 72. and 73.

  5. Add annotations directly to your plot.

ax.annotate("Annotation text",   
            xy = (x_pos, y_pos))
  1. If you are working with ordinal categorical data, make sure to also order the bars in the bar charts to represent the ordinal feature correctly.

  2. Exploit preattentive processing.

  3. Think “mobile first” (because a good portion of your audience is going to read your report on their phone).

  4. Look at each data visualization. Does it convey the message without context? If not, revise.

Effectively communicating your insights – The difference between exploratory and explanatory data analysis (Image by the author)

Finishing Touches

  1. Write a second draft.

  2. A data visualization should always be accompanied by some text. Turn the bullet points into text. Plots alone won’t cut it.

  3. Revise and edit your second draft.

  4. Use an attention-grabbing image at the beginning of your report. (You can find great photographs on Unsplash but make sure to credit your source.)

  5. Nobody wants to see your code – hide it.

  6. Hide console warnings as well.

import warnings # Supress warnings  
warnings.filterwarnings("ignore")
  1. Review your cell outputs. If it is not a data visualization, then it must help tell your story. Otherwise, hide it as well.

  2. Lead with the insights – Nobody is going to read every word of your report. Add a summary in bullet point form at the beginning.

  3. Use bolding to highlight important points in your text (because nobody is going to read every word of your report).

  4. Make sure to write a good conclusion. Did I mention that nobody is going to read every word of your report?

  5. Use a spellchecker. I like Grammarly.

  6. Make sure you check all the boxes for the challenges evaluation criteria.

  7. Phew, emojis. Love ’em or hate ’em – just promise me to not overdo them, alright?

  8. Keep in mind that your audience might not be data scientists. Is your analysis accessible?

  9. Cite your sources.

  10. Invest some time to come up with a good title.

  11. Have a friend review your report and/or read it out loud.

  12. Let the report rest for a few days.

  13. Give your report a final review.

  14. Let go of perfectionism and submit.

References

[1] Below I have listed my portfolio of prize-winning Kaggle Notebooks for your reference:

[2] geckoboard, “Data fallacies”. geckoboard.com. https://www.geckoboard.com/best-practice/statistical-fallacies/ (accessed August 14, 2022)

[3] Nussbaumer Knaflic, C. (2015). Storytelling with Data. John Wiley & Sons.


This blog was originally published on Towards Data Science on Aug 16, 2022 and moved to this site on Feb 1, 2026.

Back to top