Pandas: Quantile

In this jupyter notebook we will analyze Fortune500 companies and use the pandas quantile function to find the top companies according to their profits.

We will use Seaborn for visualizations.

The objective is to achieve the same result as the result we achieved using SQL, but this time using Python Pandas. You can see the same analysis in my previous notebook SQL Temporary Tables

import pandas as pd
import getpass  # for password input
import matplotlib as mpl  # for visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# magic command to load the ipython-sql extension. We can connect to any database which is supported by SQLAlchemy. 
%load_ext sql
# input sudo password
password = getpass.getpass()
# start the local posrgres server
command = "/etc/init.d/postgresql start"  # command to run from shell use -S as it enables input from stdin
!echo {password}|sudo -S {command} # run the command using the sudo password
# create a connection
postgresql_pass = getpass.getpass()
%sql postgresql://fede:{postgresql_pass}@localhost/datacamp

 ········


[sudo] password for fede: Starting postgresql (via systemctl): postgresql.service.


 ········





'Connected: fede@datacamp'

Load table as Pandas dataframe. While the same can be achieved by using only SQL, the purpose of this notebook is to use Pandas instead.

f = %sql select * from fortune500
f = f.DataFrame()
f.head(5)

 * postgresql://fede:***@localhost/datacamp
500 rows affected.

	rank	title	name	ticker	url	hq	sector	industry	employees	revenues	revenues_change	profits	profits_change	assets	equity
0	1	Walmart	Wal-Mart Stores, Inc.	WMT	http://www.walmart.com	Bentonville, AR	Retailing	General Merchandisers	2300000	485873.0	0.8	13643	-7.2	198825	77798
1	2	Berkshire Hathaway	Berkshire Hathaway Inc.	BRKA	http://www.berkshirehathaway.com	Omaha, NE	Financials	Insurance: Property and Casualty (Stock)	367700	223604.0	6.1	24074	0.0	620854	283001
2	3	Apple	Apple, Inc.	AAPL	http://www.apple.com	Cupertino, CA	Technology	Computers, Office Equipment	116000	215639.0	-7.7	45687	-14.4	321686	128249
3	4	Exxon Mobil	Exxon Mobil Corporation	XOM	http://www.exxonmobil.com	Irving, TX	Energy	Petroleum Refining	72700	205004.0	-16.7	7840	-51.5	330314	167325
4	5	McKesson	McKesson Corporation	MCK	http://www.mckesson.com	San Francisco, CA	Wholesalers	Wholesalers: Health Care	68000	192487.0	6.2	2258	53.0	56563	8924

The SQL funtion for getting the percentile is percentile_cont(fractions) WITHIN GROUP (ORDER BY sort_expression).

In Pandas, the function for finding percentiles is pandas.DataFrame.quantile

help(pd.DataFrame.quantile)

Help on function quantile in module pandas.core.frame:

quantile(self, q=0.5, axis=0, numeric_only=True, interpolation='linear')
    Return values at the given quantile over requested axis.
    
    Parameters
    ----------
    q : float or array-like, default 0.5 (50% quantile)
        Value between 0 <= q <= 1, the quantile(s) to compute.
    axis : {0, 1, 'index', 'columns'} (default 0)
        Equals 0 or 'index' for row-wise, 1 or 'columns' for column-wise.
    numeric_only : bool, default True
        If False, the quantile of datetime and timedelta data will be
        computed as well.
    interpolation : {'linear', 'lower', 'higher', 'midpoint', 'nearest'}
        This optional parameter specifies the interpolation method to use,
        when the desired quantile lies between two data points `i` and `j`:
    
        * linear: `i + (j - i) * fraction`, where `fraction` is the
          fractional part of the index surrounded by `i` and `j`.
        * lower: `i`.
        * higher: `j`.
        * nearest: `i` or `j` whichever is nearest.
        * midpoint: (`i` + `j`) / 2.
    
    Returns
    -------
    Series or DataFrame
    
        If ``q`` is an array, a DataFrame will be returned where the
          index is ``q``, the columns are the columns of self, and the
          values are the quantiles.
        If ``q`` is a float, a Series will be returned where the
          index is the columns of self and the values are the quantiles.
    
    See Also
    --------
    core.window.Rolling.quantile: Rolling quantile.
    numpy.percentile: Numpy function to compute the percentile.
    
    Examples
    --------
    >>> df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]),
    ...                   columns=['a', 'b'])
    >>> df.quantile(.1)
    a    1.3
    b    3.7
    Name: 0.1, dtype: float64
    >>> df.quantile([.1, .5])
           a     b
    0.1  1.3   3.7
    0.5  2.5  55.0
    
    Specifying `numeric_only=False` will also compute the quantile of
    datetime and timedelta data.
    
    >>> df = pd.DataFrame({'A': [1, 2],
    ...                    'B': [pd.Timestamp('2010'),
    ...                          pd.Timestamp('2011')],
    ...                    'C': [pd.Timedelta('1 days'),
    ...                          pd.Timedelta('2 days')]})
    >>> df.quantile(0.5, numeric_only=False)
    A                    1.5
    B    2010-07-02 12:00:00
    C        1 days 12:00:00
    Name: 0.5, dtype: object

f[['sector', 'profits']].dtypes 
# f[['sector', 'profits']].groupby('sector').quantile(.80)

sector     object
profits    object
dtype: object

Profits is an object, we need to convert to numeric. I’ll use pd.to_numeric

f['profits'] = pd.to_numeric(f.profits)
f[['sector', 'profits']].dtypes

sector      object
profits    float64
dtype: object

We are interested in finding the 80 percentile of profits per sector.

%time
percentiles_80_per_sector = f[['sector', 'profits']].groupby('sector').quantile(.8) #  Find the 80 percentile per sector, and store it in a pd.DataFrame
#  The column name is 'profit', we need to rename it to give it a more meaningful name
percentiles_80_per_sector.rename(columns={'profits': 'percentile80'}, inplace=True)
percentiles_80_per_sector

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.63 µs

	percentile80
sector
Aerospace & Defense	4507.00
Apparel	1611.28
Business Services	1419.30
Chemicals	1401.60
Energy	1311.00
Engineering & Construction	555.50
Financials	2801.00
Food & Drug Stores	2015.56
Food, Beverages & Tobacco	4608.40
Health Care	4761.80
Hotels, Restaurants & Leisure	1899.54
Household Products	1955.72
Industrials	2699.00
Materials	490.24
Media	2755.00
Motor Vehicles & Parts	2596.80
Retailing	1214.02
Technology	6641.60
Telecommunications	9551.20
Transportation	2593.40
Wholesalers	556.04

Now, let’s merge this with the f pd.DataFrame

f = pd.merge(f, percentiles_80_per_sector, left_on='sector', right_index=True)
f.head(5)

	rank	title	name	ticker	url	hq	sector	industry	employees	revenues	revenues_change	profits	profits_change	assets	equity	percentile80
0	1	Walmart	Wal-Mart Stores, Inc.	WMT	http://www.walmart.com	Bentonville, AR	Retailing	General Merchandisers	2300000	485873.0	0.8	13643.0	-7.2	198825	77798	1214.02
15	16	Costco	Costco Wholesale Corporation	COST	http://www.costco.com	Issaquah, WA	Retailing	General Merchandisers	172000	118719.0	2.2	2350.0	-1.1	33163	12079	1214.02
22	23	Home Depot	The Home Depot, Inc.	HD	http://www.homedepot.com	Atlanta, GA	Retailing	Specialty Retailers: Other	406000	94595.0	6.9	7957.0	13.5	42966	4333	1214.02
37	38	Target	Target Corporation	TGT	http://www.target.com	Minneapolis, MN	Retailing	General Merchandisers	323000	69495.0	-5.8	2737.0	-18.6	37431	10953	1214.02
39	40	Lowe’s	Lowe's Companies, Inc.	LOW	http://www.lowes.com	Mooresville, NC	Retailing	Specialty Retailers: Other	240000	65017.0	10.1	3093.0	21.5	34408	6434	1214.02

Now, filter the companies based on the percentile. The objective is to have only those companies with more or equal the percentile80:

%time
# create a filter
filter = f[['profits']].values >= f[['percentile80']].values
filter[:10]

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 7.87 µs


/home/fede/anaconda3/envs/fede/lib/python3.7/site-packages/ipykernel_launcher.py:3: RuntimeWarning: invalid value encountered in greater_equal
  This is separate from the ipykernel package so we can avoid doing imports until





array([[ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [False],
       [False],
       [ True]])

# Sanity check
print(len(f))
print(len(f[filter]))

500
107

df = f[filter]
df.head(5)

	rank	title	name	ticker	url	hq	sector	industry	employees	revenues	revenues_change	profits	profits_change	assets	equity	percentile80
0	1	Walmart	Wal-Mart Stores, Inc.	WMT	http://www.walmart.com	Bentonville, AR	Retailing	General Merchandisers	2300000	485873.0	0.8	13643.0	-7.2	198825	77798	1214.02
15	16	Costco	Costco Wholesale Corporation	COST	http://www.costco.com	Issaquah, WA	Retailing	General Merchandisers	172000	118719.0	2.2	2350.0	-1.1	33163	12079	1214.02
22	23	Home Depot	The Home Depot, Inc.	HD	http://www.homedepot.com	Atlanta, GA	Retailing	Specialty Retailers: Other	406000	94595.0	6.9	7957.0	13.5	42966	4333	1214.02
37	38	Target	Target Corporation	TGT	http://www.target.com	Minneapolis, MN	Retailing	General Merchandisers	323000	69495.0	-5.8	2737.0	-18.6	37431	10953	1214.02
39	40	Lowe’s	Lowe's Companies, Inc.	LOW	http://www.lowes.com	Mooresville, NC	Retailing	Specialty Retailers: Other	240000	65017.0	10.1	3093.0	21.5	34408	6434	1214.02

plt.figure(figsize=(5, 6))
ax = sns.scatterplot(data=df, x="profits", y="sector", s=300, hue="profits")  # I use

png

sectors = [s for s in percentiles_80_per_sector.index]
sectors

['Aerospace & Defense',
 'Apparel',
 'Business Services',
 'Chemicals',
 'Energy',
 'Engineering & Construction',
 'Financials',
 'Food & Drug Stores',
 'Food, Beverages & Tobacco',
 'Health Care',
 'Hotels, Restaurants & Leisure',
 'Household Products',
 'Industrials',
 'Materials',
 'Media',
 'Motor Vehicles & Parts',
 'Retailing',
 'Technology',
 'Telecommunications',
 'Transportation',
 'Wholesalers']

fig, ax = plt.subplots(figsize=(10, 20))
ax = plt.scatter(df.profits, df.sector, c=["blue"])  # , s=300, hue='profits') # I use
# plt.annotate("Apple", xy=(45687+450, 0))
# for i in range(len(ax._offsets)):
#    plt.annotate('X', xy=(ax._offsets[i][0],ax._offsets[i][1]))

for i in [0]:
    for s in sectors:
        title = df[df["sector"] == s].iloc[i].loc["title"]
        profits = int(df[df["sector"] == s].iloc[i].loc["profits"])
        # print(profits)
        # print(type(profits))
        # plt.annotate(title,  xy=(profits,sectors.index(s))) # annotate doesn't support rotation
        plt.text(profits, sectors.index(s), title, rotation=45)
        # print(df[df['sector'] == s].iloc[i].loc['title'])
        # print(df[df['sector'] == s].iloc[i].loc['profits'])
# plt.text(1000, 1, 'matplotlib', rotation=45)
plt.show()

png