Pandas: Merge

Pandas provides various methods for combining dataframes. Most important are:

In many situations, data comes from several sources, and we need to combine this data to perform analyses.

For instance, we have a dataset with employees’ id, names and adresses; and another dataset with employees’ salary.

In this notebook, we will see how to use pandas merge method.

The basics types of join are:

  • Left Join
  • Right Join
  • Outer Join
  • Inner Join
import pandas as pd
# Our employees' id and names
employees = pd.DataFrame({'id':[1,2,3,4],'name':['Johana','Mike','Patricia','James']})
employees

id name
0 1 Johana
1 2 Mike
2 3 Patricia
3 4 James
# Our employees' id and salary
salary = pd.DataFrame({'id':[1,2,3,5],'salary':[120000,100000,130000,90000]})
salary

id salary
0 1 120000
1 2 100000
2 3 130000
3 5 90000

Inner Join

The inner join is the most common type of join and the default for the pandas.DataFrame.merge method

employees_with_salary = employees.merge(salary,
                                        left_on="id",
                                        right_on="id",
                                        how='inner')  # default
employees_with_salary

id name salary
0 1 Johana 120000
1 2 Mike 100000
2 3 Patricia 130000

The type of join is specified with the parameter how. When performing an inner join, as seen above, we don’t need to specify this parameter as it’s the default.

inner uses the intersection of both dataframes. In our case, we can see that:

  • James has id 4, but id 4 is not in the salary dataframe.
  • From the salary dataframe, there is an employee with id 5, but there is no id 5 in the employees dataframe

The resulting dataset from an inner join doesn’t has James nor employee with id 5. This introduces us to the next type of join: the outer join.

pandas inner join

Outer Join

The outer join uses the union from both dataframes. Remeber that the parameter for specifying the type of join is how.

employees_with_salary_outer = employees.merge(salary, left_on='id', right_on='id', how='outer')
employees_with_salary_outer

id name salary
0 1 Johana 120000.0
1 2 Mike 100000.0
2 3 Patricia 130000.0
3 4 James NaN
4 5 NaN 90000.0

We see that the employee wit id 4 is included. Since this name is not in the employees dataframe, a NaN value is instead of his name.

pandas outer join

Left Join

The left join uses the keys from the left frame. In our case, James will still be on the merged dataframe:

employees_with_salary_left = employees.merge(salary, left_on='id', right_on='id', how='left')
employees_with_salary_left

id name salary
0 1 Johana 120000.0
1 2 Mike 100000.0
2 3 Patricia 130000.0
3 4 James NaN

Since our right dataframe has an employee with id 5 but there is no employee with id 5 in the left dataframe. This employee 5 will be left out from the left join resulting dataframe:

pandas left join

Right Join

And if we use the how='right' parameter, James will be left out (id 4 is not present on the salary dataframe); but the employee whose id is 5 will be in the resulting dataframe.

employees_with_salary_right = employees.merge(salary, left_on='id', right_on='id', how='right')
employees_with_salary_right

id name salary
0 1 Johana 120000
1 2 Mike 100000
2 3 Patricia 130000
3 5 NaN 90000

The left dataframe has an employee with id 4, but there is no employee with id 4 in the right dataframe. This employee will be left out from the resulting right join dataframe:

pandas right join

Previous
Next