Pandas: Merge
Pandas provides various methods for combining dataframes. Most important are:
-
merge concat
In many situations, data comes from several sources, and we need to combine this data to perform analyses.
For instance, we have a dataset with employees’ id, names and adresses; and another dataset with employees’ salary.
In this notebook, we will see how to use pandas merge method.
The basics types of join are:
- Left Join
- Right Join
- Outer Join
- Inner Join
import pandas as pd
# Our employees' id and names
employees = pd.DataFrame({'id':[1,2,3,4],'name':['Johana','Mike','Patricia','James']})
employees
| id | name | |
|---|---|---|
| 0 | 1 | Johana |
| 1 | 2 | Mike |
| 2 | 3 | Patricia |
| 3 | 4 | James |
# Our employees' id and salary
salary = pd.DataFrame({'id':[1,2,3,5],'salary':[120000,100000,130000,90000]})
salary
| id | salary | |
|---|---|---|
| 0 | 1 | 120000 |
| 1 | 2 | 100000 |
| 2 | 3 | 130000 |
| 3 | 5 | 90000 |
Inner Join
The inner join is the most common type of join and the default for the
pandas.DataFrame.merge method
employees_with_salary = employees.merge(salary,
left_on="id",
right_on="id",
how='inner') # default
employees_with_salary
| id | name | salary | |
|---|---|---|---|
| 0 | 1 | Johana | 120000 |
| 1 | 2 | Mike | 100000 |
| 2 | 3 | Patricia | 130000 |
The type of join is specified with the parameter how. When performing an inner join, as seen above, we don’t need to specify this parameter as it’s the default.
inner uses the intersection of both dataframes. In our case, we can see that:
- James has
id4, butid4 is not in the salary dataframe. - From the salary dataframe, there is an employee with
id5, but there is noid5 in the employees dataframe
The resulting dataset from an inner join doesn’t has James nor employee with id 5. This introduces us to the next type of join: the outer join.

Outer Join
The outer join uses the union from both dataframes. Remeber that the parameter for specifying the type of join is how.
employees_with_salary_outer = employees.merge(salary, left_on='id', right_on='id', how='outer')
employees_with_salary_outer
| id | name | salary | |
|---|---|---|---|
| 0 | 1 | Johana | 120000.0 |
| 1 | 2 | Mike | 100000.0 |
| 2 | 3 | Patricia | 130000.0 |
| 3 | 4 | James | NaN |
| 4 | 5 | NaN | 90000.0 |
We see that the employee wit id 4 is included. Since this name is not in the employees dataframe, a NaN value is instead of his name.

Left Join
The left join uses the keys from the left frame. In our case, James will still be on the merged dataframe:
employees_with_salary_left = employees.merge(salary, left_on='id', right_on='id', how='left')
employees_with_salary_left
| id | name | salary | |
|---|---|---|---|
| 0 | 1 | Johana | 120000.0 |
| 1 | 2 | Mike | 100000.0 |
| 2 | 3 | Patricia | 130000.0 |
| 3 | 4 | James | NaN |
Since our right dataframe has an employee with id 5 but there is no employee with id 5 in the left dataframe. This employee 5 will be left out from the left join resulting dataframe:

Right Join
And if we use the how='right' parameter, James will be left out (id 4 is not present on the salary dataframe); but the employee whose id is 5 will be in the resulting dataframe.
employees_with_salary_right = employees.merge(salary, left_on='id', right_on='id', how='right')
employees_with_salary_right
| id | name | salary | |
|---|---|---|---|
| 0 | 1 | Johana | 120000 |
| 1 | 2 | Mike | 100000 |
| 2 | 3 | Patricia | 130000 |
| 3 | 5 | NaN | 90000 |
The left dataframe has an employee with id 4, but there is no employee with id 4 in the right dataframe. This employee will be left out from the resulting right join dataframe:
