Joins with Pandas Dataframes

Joins are one of the key concepts that a Data Scientist should know. When a Data Scientist is working, it mostly goes for collecting and organizing the data. The organization is done according to the requirements.

Photo by Roman Kraft on Unsplash

Joins helps in this organizing the data and working with them. In this blog, I am going to tell you how to perform joins on the data and you only need to know Pandas for you to understand. No SQL, only Pandas.

We are going to use the function .merge() of the pandas library. It has the following syntax.

Syntax of .merge()

df1.merge(df2, on=“the column on which you want to join”)

It will take more arguments which I will be talking about during the joins.

Coming to Joins, there are 4 types of joins. They are:

Inner Join
Left Join
Right Join
Outer Join

Inner Join

Inner join returns the rows that are in both tables. Meaning that the data is lost if they are not present in both tables.

If multiple columns have the same name then they’re given suffixes. We can do that by giving the .merge() function the argument suffixes the name we want to give as a suffix.

For example, let’s say I have the following anime.csv file with the following data.

Image by author

I have read this data into a variable named anime.

anime = pd.read_csv(“anime.csv”)

We want to join this data with another file named anime_info.csv which has the data about the id, name, status of the anime, the type of anime(seasonal or long-running)

Image by Author

I am reading this data into the variable named anime_info.

anime_info = pd.read_csv(“anime_info.csv”)

Disclaimer: The data I am using is made up by me for better explanation. In real-time they are massive.

When we perform Inner join which rows do you think will remain? We will perform the join on the name column.

common_anime = anime.merge(anime_info, on=“name”)

The rows which are common in both the tables are the only ones that get selected in the inner join.

Image by author

In this join, data is lost if it is not present in both the tables.

Left Join:

It returns all the rows of the left table and only the rows that match in the right table.

For us to perform the left join we are going to use the “ how” argument.

Example:

left = anime.merge(anime_info, on=“name”, how=“left”)

Image by Author

As you can see all the data from the left table (anime.csv) are included only the matching data are included in the right table (anim_info.csv). The missing data is represented as NaN(not a number)

With this most of the data can be kept, unlike the inner join where most of the data is lost.

Right Join:

Image by Author

This is the mirror copy of the left join. In this join, all the right table data is kept and only the matching data is kept in the left table. This is also performed by using the “how” argument of .merge().

Example:

right = anime.merge(anime_info, on=“name”, how=“right”)

Image by Author

As you can see all the data from the right table is kept and the data that is matching from the left is kept. Missing values are represented by NaN.

Outer Join:

Image by Author

In this join, every data is kept. No exception. In this join, no data loss is observed.

Example:

all_anime = anime.merge(anime_info, on=“name”, how=“outer”)

Image by Author

As you can see in the above image, all the data is merged and the missing values are filled in with NaN.

The default join that this .merge() method does is the “Inner join”.

Merging a table to itself

If you are wondering when will we merge a table to itself, don’t worry I will get to it in a bit. But if you are wondering if there is a new syntax for it, then don’t worry. It is similar to the ones you’ve seen until now.

Example:

anime.merge(anime, on=“id”)

The above is just to show you how it works.

When to merge a table to itself?

When you have hierarchical relationships
When you have sequential relationships

Conclusions

The above-mentioned joins are just the basic ones. There are other ways to join data.

The key takeaways are:

What are joins
How to use them
Different types of joins
When to use them

I hope you found this blog helpful. If you liked this blog then I suggest you follow me on Medium and YouTube, for more content on productivity, self-improvement, Coding, and Tech.

And while at it why don’t you check out my recent works:

Joins with Pandas Dataframes was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

Post date August 10, 2021
Post categories In data-science, dataframes, join, pandas

This content originally appeared on Level Up Coding - Medium and was authored by Karthik Bhandary

Photo by Roman Kraft on Unsplash

We are going to use the function .merge() of the pandas library. It has the following syntax.

Syntax of .merge()

df1.merge(df2, on=“the column on which you want to join”)

It will take more arguments which I will be talking about during the joins.

Coming to Joins, there are 4 types of joins. They are:

Inner Join
Left Join
Right Join
Outer Join