Pandas: Merge
Pandas provides various methods for combining dataframes. Most important are:
-
merge
concat
In many situations, data comes from several sources, and we need to combine this data to perform analyses.
For instance, we have a dataset with employees’ id, names and adresses; and another dataset with employees’ salary.
In this notebook, we will see how to use pandas merge
method.
The basics types of join are:
- Left Join
- Right Join
- Outer Join
- Inner Join
import pandas as pd
# Our employees' id and names
employees = pd.DataFrame({'id':[1,2,3,4],'name':['Johana','Mike','Patricia','James']})
employees
id | name | |
---|---|---|
0 | 1 | Johana |
1 | 2 | Mike |
2 | 3 | Patricia |
3 | 4 | James |
# Our employees' id and salary
salary = pd.DataFrame({'id':[1,2,3,5],'salary':[120000,100000,130000,90000]})
salary
id | salary | |
---|---|---|
0 | 1 | 120000 |
1 | 2 | 100000 |
2 | 3 | 130000 |
3 | 5 | 90000 |
Inner Join
The inner join is the most common type of join and the default for the
pandas.DataFrame.merge method
employees_with_salary = employees.merge(salary,
left_on="id",
right_on="id",
how='inner') # default
employees_with_salary
id | name | salary | |
---|---|---|---|
0 | 1 | Johana | 120000 |
1 | 2 | Mike | 100000 |
2 | 3 | Patricia | 130000 |
The type of join is specified with the parameter how
. When performing an inner join, as seen above, we don’t need to specify this parameter as it’s the default.
inner
uses the intersection of both dataframes. In our case, we can see that:
- James has
id
4, butid
4 is not in the salary dataframe. - From the salary dataframe, there is an employee with
id
5, but there is noid
5 in the employees dataframe
The resulting dataset from an inner
join doesn’t has James nor employee with id
5. This introduces us to the next type of join: the outer
join.
Outer Join
The outer
join uses the union from both dataframes. Remeber that the parameter for specifying the type of join is how
.
employees_with_salary_outer = employees.merge(salary, left_on='id', right_on='id', how='outer')
employees_with_salary_outer
id | name | salary | |
---|---|---|---|
0 | 1 | Johana | 120000.0 |
1 | 2 | Mike | 100000.0 |
2 | 3 | Patricia | 130000.0 |
3 | 4 | James | NaN |
4 | 5 | NaN | 90000.0 |
We see that the employee wit id
4 is included. Since this name is not in the employees
dataframe, a NaN
value is instead of his name.
Left Join
The left join uses the keys from the left frame. In our case, James will still be on the merged dataframe:
employees_with_salary_left = employees.merge(salary, left_on='id', right_on='id', how='left')
employees_with_salary_left
id | name | salary | |
---|---|---|---|
0 | 1 | Johana | 120000.0 |
1 | 2 | Mike | 100000.0 |
2 | 3 | Patricia | 130000.0 |
3 | 4 | James | NaN |
Since our right dataframe has an employee with id 5
but there is no employee with id 5
in the left dataframe. This employee 5
will be left out from the left join resulting dataframe:
Right Join
And if we use the how='right'
parameter, James will be left out (id
4 is not present on the salary
dataframe); but the employee whose id
is 5 will be in the resulting dataframe.
employees_with_salary_right = employees.merge(salary, left_on='id', right_on='id', how='right')
employees_with_salary_right
id | name | salary | |
---|---|---|---|
0 | 1 | Johana | 120000 |
1 | 2 | Mike | 100000 |
2 | 3 | Patricia | 130000 |
3 | 5 | NaN | 90000 |
The left dataframe has an employee with id
4, but there is no employee with id
4 in the right dataframe. This employee will be left out from the resulting right join dataframe: