This content originally appeared on Level Up Coding - Medium and was authored by Karthik Bhandary
Joins are one of the key concepts that a Data Scientist should know. When a Data Scientist is working, it mostly goes for collecting and organizing the data. The organization is done according to the requirements.
Joins helps in this organizing the data and working with them. In this blog, I am going to tell you how to perform joins on the data and you only need to know Pandas for you to understand. No SQL, only Pandas.
We are going to use the function .merge() of the pandas library. It has the following syntax.
Syntax of .merge()
df1.merge(df2, on=“the column on which you want to join”)
It will take more arguments which I will be talking about during the joins.
Coming to Joins, there are 4 types of joins. They are:
- Inner Join
- Left Join
- Right Join
- Outer Join
Inner Join
Inner join returns the rows that are in both tables. Meaning that the data is lost if they are not present in both tables.
If multiple columns have the same name then they're given suffixes. We can do that by giving the .merge() function the argument suffixes the name we want to give as a suffix.
For example, let's say I have the following anime.csv file with the following data.
I have read this data into a variable named anime.
anime = pd.read_csv(“anime.csv”)
We want to join this data with another file named anime_info.csv which has the data about the id, name, status of the anime, the type of anime(seasonal or long-running)
I am reading this data into the variable named anime_info.
anime_info = pd.read_csv(“anime_info.csv”)
Disclaimer: The data I am using is made up by me for better explanation. In real-time they are massive.
When we perform Inner join which rows do you think will remain? We will perform the join on the name column.
common_anime = anime.merge(anime_info, on=“name”)
The rows which are common in both the tables are the only ones that get selected in the inner join.
In this join, data is lost if it is not present in both the tables.
Left Join:
It returns all the rows of the left table and only the rows that match in the right table.
For us to perform the left join we are going to use the “ how” argument.
Example:
left = anime.merge(anime_info, on=“name”, how=“left”)
As you can see all the data from the left table (anime.csv) are included only the matching data are included in the right table (anim_info.csv). The missing data is represented as NaN(not a number)
With this most of the data can be kept, unlike the inner join where most of the data is lost.
Right Join:
This is the mirror copy of the left join. In this join, all the right table data is kept and only the matching data is kept in the left table. This is also performed by using the “how” argument of .merge().
Example:
right = anime.merge(anime_info, on=“name”, how=“right”)
As you can see all the data from the right table is kept and the data that is matching from the left is kept. Missing values are represented by NaN.
Outer Join:
In this join, every data is kept. No exception. In this join, no data loss is observed.
Example:
all_anime = anime.merge(anime_info, on=“name”, how=“outer”)
As you can see in the above image, all the data is merged and the missing values are filled in with NaN.
The default join that this .merge() method does is the “Inner join”.
Merging a table to itself
If you are wondering when will we merge a table to itself, don’t worry I will get to it in a bit. But if you are wondering if there is a new syntax for it, then don’t worry. It is similar to the ones you’ve seen until now.
Example:
anime.merge(anime, on=“id”)
The above is just to show you how it works.
When to merge a table to itself?
- When you have hierarchical relationships
- When you have sequential relationships
Conclusions
The above-mentioned joins are just the basic ones. There are other ways to join data.
The key takeaways are:
- What are joins
- How to use them
- Different types of joins
- When to use them
I hope you found this blog helpful. If you liked this blog then I suggest you follow me on Medium and YouTube, for more content on productivity, self-improvement, Coding, and Tech.
And while at it why don’t you check out my recent works:
Joins with Pandas Dataframes was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.
This content originally appeared on Level Up Coding - Medium and was authored by Karthik Bhandary
Karthik Bhandary | Sciencx (2021-08-10T04:30:20+00:00) Joins with Pandas Dataframes. Retrieved from https://www.scien.cx/2021/08/10/joins-with-pandas-dataframes/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.