SQL Temporary Tables

Definition: A temporary table is a table that will persist for a session only. When the connection is closed, the temporary table is dropped.

In this notebook, we will explore and visualize data from the Fortune 500 companies and find the best performing companies using, sector-wise. We will use PostgreSQL temporary tables.

In Postgresql, a temporary table is created specifying TEMPORARY or TEMP when using the CREATE TABLE AS SQL command.

Example:

CREATE TEMP TABLE TOP_COMPANIES AS ( SELECT * FROM FORTUNE500 WHERE PROFITS > 0 ORDER BY PROFITS DESC LIMIT 10 ); SELECT * FROM TOP_COMPANIES ORDER BY PROFITS DESC ;

Result:

`rank|title |name |ticker|url |hq |sector |industry |employees|revenues|revenues_change|profits|profits_change|assets |equity|
3 Apple Apple, Inc. AAPL http://www.apple.com Cupertino, CA Technology Computers, Office Equipment 116000 215639 -7.7 45687 -14.4 321686 128249
21 J.P. Morgan Chase JPMorgan Chase & Co. JPM http://www.jpmorganchase.com New York, NY Financials Commercial Banks 243355 105486 4.4 24733 1.2 2490972 254190
2 Berkshire Hathaway Berkshire Hathaway Inc. BRKA http://www.berkshirehathaway.com Omaha, NE Financials Insurance: Property and Casualty (Stock) 367700 223604 6.1 24074 0 620854 283001
25 Wells Fargo Wells Fargo & Company WFC http://www.wellsfargo.com San Francisco, CA Financials Commercial Banks 269100 94176 4.6 21938 -4.2 1930115 199581
27 Alphabet Alphabet Inc. GOOGL http://www.abc.xyz Mountain View, CA Technology Internet Services and Retailing 72053 90272 20.4 19478 19.1 167497 139036
26 Bank of America Corp. Bank of America Corporation BAC http://www.bankofamerica.com Charlotte, NC Financials Commercial Banks 208024 93662 0.7 17906 12.7 2187702 266840
28 Microsoft Microsoft Corporation MSFT http://www.microsoft.com Redmond, WA Technology Computer Software 114000 85320 -8.8 16798 37.8 193694 71997
35 Johnson & Johnson Johnson & Johnson JNJ http://www.jnj.com New Brunswick, NJ Health Care Pharmaceuticals 126400 71890 2.6 16540 7.3 141208 70418
30 Citigroup Citigroup Inc. C http://www.citigroup.com New York, NY Financials Commercial Banks 219000 82386 -6.7 14912 -13.5 1792077 225120
148 Altria Group Altria Group, Inc. MO http://www.altria.com Richmond, VA Food, Beverages & Tobacco Tobacco 8300 19337 2.6 14239 171.7 45932 12770

Temporary Tables vs. Common Table Expressions

Temporary tables are very similar to common table expressions in syntax and functionality.

From the PostgreSQL docs:

WITH provides a way to write auxiliary statements for use in a larger query. These statements, which are often referred to as Common Table Expressions or CTEs, can be thought of as defining temporary tables that exist just for one query.

Example:

WITH TOP_COMPANIES_CTE AS ( SELECT * FROM FORTUNE500 WHERE PROFITS > 0 ORDER BY PROFITS DESC LIMIT 10 ) SELECT * FROM TOP_COMPANIES_CTE

Result:

`rank|title |name |ticker|url |hq |sector |industry |employees|revenues|revenues_change|profits|profits_change|assets |equity|
3 Apple Apple, Inc. AAPL http://www.apple.com Cupertino, CA Technology Computers, Office Equipment 116000 215639 -7.7 45687 -14.4 321686 128249
21 J.P. Morgan Chase JPMorgan Chase & Co. JPM http://www.jpmorganchase.com New York, NY Financials Commercial Banks 243355 105486 4.4 24733 1.2 2490972 254190
2 Berkshire Hathaway Berkshire Hathaway Inc. BRKA http://www.berkshirehathaway.com Omaha, NE Financials Insurance: Property and Casualty (Stock) 367700 223604 6.1 24074 0 620854 283001
25 Wells Fargo Wells Fargo & Company WFC http://www.wellsfargo.com San Francisco, CA Financials Commercial Banks 269100 94176 4.6 21938 -4.2 1930115 199581
27 Alphabet Alphabet Inc. GOOGL http://www.abc.xyz Mountain View, CA Technology Internet Services and Retailing 72053 90272 20.4 19478 19.1 167497 139036
26 Bank of America Corp. Bank of America Corporation BAC http://www.bankofamerica.com Charlotte, NC Financials Commercial Banks 208024 93662 0.7 17906 12.7 2187702 266840
28 Microsoft Microsoft Corporation MSFT http://www.microsoft.com Redmond, WA Technology Computer Software 114000 85320 -8.8 16798 37.8 193694 71997
35 Johnson & Johnson Johnson & Johnson JNJ http://www.jnj.com New Brunswick, NJ Health Care Pharmaceuticals 126400 71890 2.6 16540 7.3 141208 70418
30 Citigroup Citigroup Inc. C http://www.citigroup.com New York, NY Financials Commercial Banks 219000 82386 -6.7 14912 -13.5 1792077 225120
148 Altria Group Altria Group, Inc. MO http://www.altria.com Richmond, VA Food, Beverages & Tobacco Tobacco 8300 19337 2.6 14239 171.7 45932 12770

As we can see, the same result is achieved using a CTE (WITH)instead of a TEMP table.

Differences between temporary tables and common table expressions:

  • WITH statements are scoped to the query, whereas temporary tables persist for the session.
  • WITH can be used for other things, like Recursive Query Evaluation and creating data-modifying statements.
  • Temporary tables allow adding indexes.
  • Temporary tables are slightly less efficient.

This means that we usually should choose to use a CTE instead of a temporary table, unless we will benefit from an index.

Temporary Tables vs Common Table Expressions vs Subqueries

Another alternative is to use subqueries, as in the following example:

SELECT * FROM (SELECT * FROM FORTUNE500 WHERE PROFITS > 0 ORDER BY PROFITS DESC LIMIT 10) AS TOP_COMPANIES_SUBQUERY

The result is the same. And in many simple cases, are very similar. Since we should use temporary tables instead of CTEs only if we will benefit from an index, we will focus on the differences between CTEs and Subqueriers:

CTE vs Subqueries:

  • If the same subquery will be used multiple times, it is preferable to use a CTE to take advantage of encapsulation. We declare the subquery only once and call it whenever it’s needed. This doesn’t mean that the CTE will be executed only once, in many cases, even if the CTE is created once, the query optimizer might execute it many times.

  • A CTE can be used recursively; a sub-query cannot.

CTEs in PostgreSQL

Unile other DMBSs, like MS SQL or Oracke, PostgreSQL evaluates the query inside the CTE and stores the result (see https://medium.com/@hakibenita/be-careful-with-cte-in-postgresql-fca5e24d2119). By materializing the CTE, some CTEs in PostgreSQL can incur in a performance hit.

From the docs:

A useful property of WITH queries is that they are evaluated only once per execution of the parent query, even if they are referred to more than once by the parent query or sibling WITH queries. Thus, expensive calculations that are needed in multiple places can be placed within a WITH query to avoid redundant work. Another possible application is to prevent unwanted multiple evaluations of functions with side-effects. However, the other side of this coin is that the optimizer is less able to push restrictions from the parent query down into a WITH query than an ordinary subquery. The WITH query will generally be evaluated as written, without suppression of rows that the parent query might discard afterwards. (But, as mentioned above, evaluation might stop early if the reference(s) to the query demand only a limited number of rows.)

Different Implementation of Temporary tables according to RMDSs

  • In MS SQL Temporary tables are created in tempdb. Temporary tables are stored on disk. MS SQL also implements an alternative to temporary tables, called table variables
  • Postgresql uses a set of temporary schemas. Temporary tables are stored in memory and overflow to disk after the parameter temp_buffers in postgresql.conf is reached.
  • Oracle uses a temp_file
  • MySQL stores temporary tables in memory. Lots of temporary tables can eat up a lot of memory and cause other performance problems.
  • DB2 uses temporary table space It may use memory and spill to disk if the data is too large.

Objective

In this notebook, I’ll show how to:

  • Use sql temporary tables.
  • Use the ipython-sql extension run SQL commands from a jupyter notebook and get data from a database.
  • Store the output of SQL commands in Pandas dataframes.
  • Create a visualization of the data with Seaborn.
# import required modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import getpass  # for password input
from typing import List  # optional type annotations
# magic command to load the ipython-sql extension. We can connect to any database which is supported by SQLAlchemy. 
%load_ext sql
# input sudo password
password = getpass.getpass()
 ········
# start the local posrgres server
command = "/etc/init.d/postgresql start" # command to run from shell use -S as it enables input from stdin
!echo {password}|sudo -S {command} # run the command using the sudo password
[sudo] password for fede: Starting postgresql (via systemctl): postgresql.service.

If you see Starting postgresql (via systemctl): postgresql.service., postgresql has been sucessfully started

# create a connection
postgresql_pass = getpass.getpass()
%sql postgresql://fede:{postgresql_pass}@localhost/datacamp
print(r'DON\'T FORGET TO CLOSE THE CONNECTION')
 ········


DON\'T FORGET TO CLOSE THE CONNECTION

If you see Connected: username@databasename, you sucessfully created a connection to the database. Dont forget to close the connection when finishing using it.

Let’s print the first rows from table fortune500. This table contains information from fortune500 companies.

%%sql
select * from fortune500
limit 5
 * postgresql://fede:***@localhost/datacamp
5 rows affected.
rank title name ticker url hq sector industry employees revenues revenues_change profits profits_change assets equity
1 Walmart Wal-Mart Stores, Inc. WMT http://www.walmart.com Bentonville, AR Retailing General Merchandisers 2300000 485873 0.8 13643 -7.2 198825 77798
2 Berkshire Hathaway Berkshire Hathaway Inc. BRKA http://www.berkshirehathaway.com Omaha, NE Financials Insurance: Property and Casualty (Stock) 367700 223604 6.1 24074 0.0 620854 283001
3 Apple Apple, Inc. AAPL http://www.apple.com Cupertino, CA Technology Computers, Office Equipment 116000 215639 -7.7 45687 -14.4 321686 128249
4 Exxon Mobil Exxon Mobil Corporation XOM http://www.exxonmobil.com Irving, TX Energy Petroleum Refining 72700 205004 -16.7 7840 -51.5 330314 167325
5 McKesson McKesson Corporation MCK http://www.mckesson.com San Francisco, CA Wholesalers Wholesalers: Health Care 68000 192487 6.2 2258 53.0 56563 8924

Besides showing how to use SQL temporary tables, the purpose of this notebook is to find those companies that have profits larger than the 80th percentile, per sector.

The first step is to create a temporary table that will hold the 80th percentile per sector.

Percentiles

The SQL command to get the percentile is percentile_disc. ‘disc’ stands for discrete. Another option would be to use percentile_cont to find the percentile assuming a continuous distribution of values. percentile_disc uses an actual value from the column.

%%sql 
create temporary table percentiles as
select sector, percentile_disc(0.8) within group(order by profits) perc80
from fortune500
group by sector;
 * postgresql://fede:***@localhost/datacamp
21 rows affected.





[]
%%sql
-- check the 80th percentile per sector
select *
from percentiles 
order by perc80 desc;
 * postgresql://fede:***@localhost/datacamp
21 rows affected.
sector perc80
Telecommunications 8695
Technology 7266
Food, Beverages & Tobacco 6073
Health Care 4965
Aerospace & Defense 4895
Motor Vehicles & Parts 4596
Industrials 3695
Financials 3014
Media 2755
Transportation 2676
Household Products 2166
Food & Drug Stores 2025.7
Hotels, Restaurants & Leisure 1670
Chemicals 1500
Business Services 1401
Energy 1311
Retailing 1228
Apparel 1074.1
Wholesalers 605.9
Engineering & Construction 602.7
Materials 496
# store the values of this SQL temporary table to a an ipython-sql sql.run.ResultSet 
# I'll convert this result set to a pandas dataframe to visualize it.
perc80 = %sql select * from percentiles
 * postgresql://fede:***@localhost/datacamp
21 rows affected.
# convert the sql result set to a dataframe
df_perc80: pd.DataFrame = perc80.DataFrame() 
# sanity check
df_perc80.head()

sector perc80
0 Aerospace & Defense 4895
1 Apparel 1074.1
2 Business Services 1401
3 Chemicals 1500
4 Energy 1311
# I sort the values in descending order, to create the visualization
df_perc80.sort_values(by='perc80', ascending=False, inplace=True)

Now, I’ll create a visualization using Seaborn. I’ll use a barplot.

# create a barplot
sns.set(style="darkgrid")
sns.set_palette("Blues")

plt.figure(figsize=(15,10))
ax = sns.barplot(data=df_perc80,y='sector',x='perc80', palette="Blues_r") #swith x and y for a horizontal barplot
ax.set_xlabel('80th percentile of profits', fontsize=10)
ax.set_title('80th Percentile per Sector',fontsize= 30) # title of plot
Text(0.5, 1.0, '80th Percentile per Sector')

png

We can see that there is a big difference between 80th percentiles per sector. Telcos, Techs and Food companies have large 80th percentiles of profits.
Companies in the Wholesalers, Engineering & Construction and Materials sectors have much lower 80th percentile of profits.

Companies with large profits, per sector

Now, I want find those companies that have profits above the 80th percentile per sector. Afterward, I’ll create a visualization of this data.

The first step, will be to create a list of companies that have profits above the 80th of the corresponding sector. I’ll need a sector column.

Also, I’ll join this data with the previously created temporary table, to have a column with the 80th percentile of profits (column perc80)

I’ll also create a ratio column to compare the profits of each company, with the 80th percentile. Of course, since I’m selecting only companies that have profits higher than the 80th percentile, this ratio will be always higher than 100.

%%sql
-- now create a list of companies above 80th percentile, per sector
select title, f.sector, profits, perc80, round(f.profits / p.perc80 * 100,0) ratio
from fortune500 f
left join percentiles p -- join whith the previouly created temporary table
on f.sector = p.sector
where f.profits >= p.perc80 -- only companies with profits above the 80th percentile
order by profits desc
limit 10 -- for demonstration purposes, I'll limit the results to only 10 companies, this limit will not be used below
 * postgresql://fede:***@localhost/datacamp
10 rows affected.
title sector profits perc80 ratio
Apple Technology 45687 7266 629
J.P. Morgan Chase Financials 24733 3014 821
Berkshire Hathaway Financials 24074 3014 799
Wells Fargo Financials 21938 3014 728
Alphabet Technology 19478 7266 268
Bank of America Corp. Financials 17906 3014 594
Microsoft Technology 16798 7266 231
Johnson & Johnson Health Care 16540 4965 333
Citigroup Financials 14912 3014 495
Altria Group Food, Beverages & Tobacco 14239 6073 234

We can see that companies with the largest profits are:

  1. Apple
  2. JP Morgan
  3. Berkshire Hathaway
  4. Wells Fargo
  5. Alphabet (Google)

We can also see that most companies in the top 10 are from Tech and Financials sectors.

# I save the same previous SQL command as a string, for later execution. Note that I removed the `limit` to get all results and not only the top 10
query_string: str = """
select title, f.sector, profits, perc80, round(f.profits / p.perc80 * 100,0) ratio \n
from fortune500 f \n
left join percentiles p \n
on f.sector = p.sector \n
where f.profits >= p.perc80
order by profits desc
"""
# get the query results as a sql result set
sql_result = %sql $query_string
df = sql_result.DataFrame()   # convert it to a Pandas dataframe
df.head(5) # sanity check
 * postgresql://fede:***@localhost/datacamp
111 rows affected.

title sector profits perc80 ratio
0 Apple Technology 45687 7266 629
1 J.P. Morgan Chase Financials 24733 3014 821
2 Berkshire Hathaway Financials 24074 3014 799
3 Wells Fargo Financials 21938 3014 728
4 Alphabet Technology 19478 7266 268
# convert columns to integer type
df[["profits", "perc80", "ratio"]] = df[["profits", "perc80", "ratio"]].apply(
    pd.to_numeric
)

Visualization

Now I’ll create a visualization of the companies that have profits above the 80th percentile of their sector.

# I create a list of sectors, this will be used later for annotating the plot
sectors: np.array = df["sector"].unique()  # get a list of sectors from the datafram
sectors: List[str] = list(sectors)  # convert the numpy array to a python list
print(sectors)  # sanity check
['Technology', 'Financials', 'Health Care', 'Food, Beverages & Tobacco', 'Retailing', 'Telecommunications', 'Household Products', 'Motor Vehicles & Parts', 'Media', 'Industrials', 'Energy', 'Business Services', 'Aerospace & Defense', 'Hotels, Restaurants & Leisure', 'Transportation', 'Chemicals', 'Food & Drug Stores', 'Apparel', 'Wholesalers', 'Materials', 'Engineering & Construction']

Now, I’ll use matplotlib Artist layer to create an horizontal barplot.

plt.figure(figsize=(5,6))
ax = sns.scatterplot(data=df,x='profits',y='sector', s=300, hue='profits') # I use 

png

fig, ax = plt.subplots(figsize=(10, 20))
ax = plt.scatter(df.profits, df.sector, c=["blue"])  # , s=300, hue='profits') # I use
# plt.annotate("Apple", xy=(45687+450, 0))
# for i in range(len(ax._offsets)):
#    plt.annotate('X', xy=(ax._offsets[i][0],ax._offsets[i][1]))

for i in [0]:
    for s in sectors:
        title = df[df["sector"] == s].iloc[i].loc["title"]
        profits = int(df[df["sector"] == s].iloc[i].loc["profits"])
        # print(profits)
        # print(type(profits))
        # plt.annotate(title,  xy=(profits,sectors.index(s))) # annotate doesn't support rotation
        plt.text(profits, sectors.index(s), title, rotation=45)
        # print(df[df['sector'] == s].iloc[i].loc['title'])
        # print(df[df['sector'] == s].iloc[i].loc['profits'])
# plt.text(1000, 1, 'matplotlib', rotation=45)
plt.show()

png

# close the SQL connection or stop the database server
# WARNING: Double check if used in production
command_restart = "/etc/init.d/postgresql restart"  # to close all the connections and restart the server
command_close = "/etc/init.d/postgresql stop"  # to stop the server
!echo {password}|sudo -S {command_close}
[sudo] password for fede: Stopping postgresql (via systemctl): postgresql.service.

Moving-On

In a later notebook, I performed this same analysis using Python Pandas. Feel free to check it out:

Analysis of Fortune500 companies profits with Pandas

Previous
Next