Pandas: Merge

Pandas provides various methods for combining dataframes. Most important are:

merge
concat

In many situations, data comes from several sources, and we need to combine this data to perform analyses.

For instance, we have a dataset with employees’ id, names and adresses; and another dataset with employees’ salary.

In this notebook, we will see how to use pandas merge method.

The basics types of join are:

Left Join
Right Join
Outer Join
Inner Join

import pandas as pd

# Our employees' id and names
employees = pd.DataFrame({'id':[1,2,3,4],'name':['Johana','Mike','Patricia','James']})
employees

	id	name
0	1	Johana
1	2	Mike
2	3	Patricia
3	4	James

# Our employees' id and salary
salary = pd.DataFrame({'id':[1,2,3,5],'salary':[120000,100000,130000,90000]})
salary

	id	salary
0	1	120000
1	2	100000
2	3	130000
3	5	90000

Inner Join

The inner join is the most common type of join and the default for the pandas.DataFrame.merge method

employees_with_salary = employees.merge(salary,
                                        left_on="id",
                                        right_on="id",
                                        how='inner')  # default
employees_with_salary

	id	name	salary
0	1	Johana	120000
1	2	Mike	100000
2	3	Patricia	130000

The type of join is specified with the parameter how. When performing an inner join, as seen above, we don’t need to specify this parameter as it’s the default.

inner uses the intersection of both dataframes. In our case, we can see that:

James has id 4, but id 4 is not in the salary dataframe.
From the salary dataframe, there is an employee with id 5, but there is no id 5 in the employees dataframe

The resulting dataset from an inner join doesn’t has James nor employee with id 5. This introduces us to the next type of join: the outer join.

pandas inner join

Outer Join

The outer join uses the union from both dataframes. Remeber that the parameter for specifying the type of join is how.

employees_with_salary_outer = employees.merge(salary, left_on='id', right_on='id', how='outer')
employees_with_salary_outer

	id	name	salary
0	1	Johana	120000.0
1	2	Mike	100000.0
2	3	Patricia	130000.0
3	4	James	NaN
4	5	NaN	90000.0

We see that the employee wit id 4 is included. Since this name is not in the employees dataframe, a NaN value is instead of his name.

pandas outer join

Left Join

The left join uses the keys from the left frame. In our case, James will still be on the merged dataframe:

employees_with_salary_left = employees.merge(salary, left_on='id', right_on='id', how='left')
employees_with_salary_left

	id	name	salary
0	1	Johana	120000.0
1	2	Mike	100000.0
2	3	Patricia	130000.0
3	4	James	NaN

Since our right dataframe has an employee with id 5 but there is no employee with id 5 in the left dataframe. This employee 5 will be left out from the left join resulting dataframe:

pandas left join

Right Join

And if we use the how='right' parameter, James will be left out (id 4 is not present on the salary dataframe); but the employee whose id is 5 will be in the resulting dataframe.

employees_with_salary_right = employees.merge(salary, left_on='id', right_on='id', how='right')
employees_with_salary_right

	id	name	salary
0	1	Johana	120000
1	2	Mike	100000
2	3	Patricia	130000
3	5	NaN	90000

The left dataframe has an employee with id 4, but there is no employee with id 4 in the right dataframe. This employee will be left out from the resulting right join dataframe:

pandas right join

Last updated on Mar 15, 2020

Edit this page