How to Merge Pandas DataFrames

One common pitfall when merging two DataFrames is unintentionally losing valuable data points. Sometimes you need to extend your initial dataset with additional information from a second dataset. For this, you can read the two datasets into pandas DataFrames and then combine them with the .merge() method into one DataFrame. However, depending on how you merge them, you can end up with fewer or more data points as expected.

However, depending on how you merge them, you can end up with fewer or more data points as expected.

This article will go over the four most common methods to merge two DataFrames. Since merging pandas DataFrames is similar to SQL joins, we will use them as analogies [1]. Namely, we will showcase how to conduct:

LEFT OUTER JOIN (pandas: “left”)
RIGHT OUTER JOIN (pandas: “right”)
FULL OUTER JOIN (pandas: “outer”)
INNER JOIN (pandas: “inner”)

Also, we will show you how you can verify your results.

Fundamentals

To explain the concepts we will use the following two minimal fictional datasets. In this example, we have two tables for pandas in zoos. The first table contains the location information about zoos. The second table contains information about which panda is in which zoo.

Left DataFrame df1 (blue) and right DataFrame df2 (yellow) (Image by the author)

The DataFrames are colored to illustrate which entries result from which DataFrame in the following examples. When merging two DataFrames, you refer to them as “left” and “right” DataFrame. In this example, df1 is the left DataFrame and is colored in blue. df2 is the right DataFrame and is colored in yellow. If an entry in the merged DataFrame results from both DataFrames, it will be indicated with a green row background.

Left DataFrame: df1, colored in blue
Right DataFrame: df2, colored in yellow
Key column: Common column to merge df1 and df2 on. In this example, the key column is “zoo_id”.
Merged DataFrame: df_merged with rows from left in blue, from right in yellow, and from both in green

Let’s have a look at the .merge() method and its essential parameters. This method has more parameters than those discussed below. However, we will only touch on those that are relevant to this article. You can refer to the documentation [2] for further parameters.

DataFrame.merge(right,   
                how = "...",   
                on = None,   
                indicator = False,  
                ...)

First, you call the .merge() method from the left Dataframe df1 and the first parameter is the right DataFrame df2.

df_merged = df1.merge(df2)

You could also merge two DataFrames as follows, where the first argument is the left DataFrame and the second argument is the right DataFrame:

df_merged = pd.merge(df1, df2)

While the .merge() method is smart enough to find the common key column to merge on, I would recommend to explicitly define it with the parameter on. Not only does it make your code more readable, but this also speeds up the execution time.

df_merged = df1.merge(df2,  
                      on = "zoo_id")

If the key columns don’t have an identical name in both DataFrames, you can use the parameters on_left and on_right instead of on.

df_merged = df1.merge(df2,  
                      on_left = "key1",  
                      on_right = "key2")

To indicate from which DataFrame a row in the merged DataFrame resulted from, we will use the parameter indicator = True. This option will create a new column “_merge” in the merged DataFrame as you will see in the following examples. For the regular usage of merging DataFrames, you can omit the indicator parameter.

FULL OUTER JOIN (pandas: “outer”)

What if you would want to get the big picture of every panda and every zoo?

For this, you would use a FULL OUTER JOIN in SQL speak [1].

SELECT *  
FROM df1  
FULL OUTER JOIN df2  
  ON df1.zoo_id = df2.zoo_id;

In pandas, you would use how = "outer" [2].

df_merged = df1.merge(df2,   
                      on = "zoo_id",   
                      how = "outer",  
                      indicator = True)

Below you can see every possibility to match each row from both DataFrames via the key column. The values 101, 102, and 103 appear in both key columns of both DataFrames. A match with both DataFrames is indicated with a green dot at the intersection of the two DataFrames.

However, the value 104 only appears in the key column of the left DataFrame and the value 105 only appears in the key column of the right DataFrame. The unmatched rows are indicated with a blue or yellow dot respectively at the intersection with a line called “no match”.

A full outer join contains all the dots in the below figure.

What if you would want to get the big picture of every panda and every zoo?

DataFrame merged with "outer" (Image by the author) — DataFrame merged with “outer” (Image by the author)

As a sanity check, the expected length of the merged DataFrame should be longer than or equal to the length of the longer DataFrame. The merged DataFrame df_merged has a total of seven rows: four from both, one from left only, and two from right only as indicated in the column _merge.

While the green rows contain no NULL values, the blue and yellow rows have missing values. Since the green rows result from both DataFrames, each column has a value. However, since the left DataFrame df2 did not contain any pandas living in the zoo with the zoo_id = 104, the column panda_name is nan for row 4. The same goes for the yellow rows 5 and 6 since df1 did not contain any information about the zoo with the zoo_id = 105.

INNER JOIN (pandas: “inner”)

But what if you would want to look at only zoos which house pandas?

For this, you would use an INNER JOIN in SQL speak [1].

SELECT *  
FROM df1  
INNER JOIN df2  
  ON df1.zoo_id = df2.zoo_id;

In pandas, you would use how = "inner" [2].

df_merged = df1.merge(df2,   
                      on = "zoo_id",   
                      how = "inner",  
                      indicator = True)

In the below figure, you can again see the matches as described for the FULL OUTER JOIN. However, an INNER JOIN only considers the green dots, which indicate that a value is present in both key columns of both DataFrames. The unmatched values (blue and yellow dots from FULL OUTER JOIN) are excluded as illustrated in the above panda Venn diagram.

But what if you would want to look at only zoos which house pandas?

DataFrame merged with "inner" (Image by the author) — DataFrame merged with “inner” (Image by the author)

As a sanity check, the expected length of the merged DataFrame should be longer than or equal to the length of the shorter DataFrame. The merged DataFrame df_merged has a total of four rows: four from both as indicated in the column _merge.

LEFT OUTER JOIN (pandas: “left”)

Now, let’s say you would want to know for every zoo which pandas it houses. E.g., with this information, you could calculate, how many pandas each zoo has.

For this, you would use a LEFT OUTER JOIN in SQL speak [1].

SELECT *  
FROM df1  
LEFT OUTER JOIN df2  
  ON df1.zoo_id = df2.zoo_id;

In pandas, you would use how = "left" [2].

df_merged = df1.merge(df2,   
                      on = "zoo_id",   
                      how = "left",  
                      indicator = True)

In the below figure, you can again see the matches as described for the FULL OUTER JOIN. However, a LEFT OUTER JOIN only considers the green and blue dots as illustrated in the above panda Venn diagram. The unmatched values from the right DataFrame (yellow dots from FULL OUTER JOIN) are excluded.

Let’s say you would want to calculate how many pandas each zoo has.

DataFrame merged with "left" (Image by the author) — DataFrame merged with “left” (Image by the author)

As a sanity check, the expected length of the merged DataFrame should be longer than or equal to the length of the left DataFrame. The merged DataFrame df_merged has a total of five rows: four from both and one from left only as indicated in the column _merge.

RIGHT OUTER JOIN (pandas: “right”)

Finally, let’s say you would want to know for every panda in which zoo it lives.

For this, you would use a RIGHT OUTER JOIN in SQL speak [1].

SELECT *  
FROM df1  
RIGHT OUTER JOIN df2  
  ON df1.zoo_id = df2.zoo_id;

In pandas, you would use how = "right" [2].

df_merged = df1.merge(df2,   
                      on = "zoo_id",   
                      how = "right",  
                      indicator = True)

In the below figure, you can again see the matches as described for the FULL OUTER JOIN. However, a RIGHT OUTER JOIN only considers the green and yellow dots as illustrated in the above panda Venn diagram. The unmatched values from the left DataFrame (blue dots from FULL OUTER JOIN) are excluded.

Let’s say you would want to know for every panda in which zoo it lives.

DataFrame merged with "right" (Image by the author) — DataFrame merged with “right” (Image by the author)

As a sanity check, the expected length of the merged DataFrame should be longer than or equal to the length of the right DataFrame. The merged DataFrame df_merged has a total of six rows: four from both and two from right only as indicated in the column _merge.

Conclusion

This article (literally) illustrated how to merge two pandas DataFrames with the .merge() method. Namely, we looked at how to conduct the most common types of SQL joins in pandas: FULL OUTER JOIN, INNER JOIN, LEFT OUTER JOIN, and RIGHT OUTER JOIN.

Below you can find a visual summary of this article as a cheat sheet:

Cheat sheet for how to merge pandas DataFrames (Image by the author)

References

[1] “pandas 1.4.2 documentation”, “Comparison with SQL.” pandas.pydata.org. https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html#join (accessed July 13, 2022)

[2] “pandas 1.4.2 documentation”, “pandas.DataFrame.merge.” pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html (accessed July 13, 2022)

This blog was originally published on Towards Data Science on Jul 20, 2022 and moved to this site on Feb 1, 2026.