SQL Temporary Tables

Definition: A temporary table is a table that will persist for a session only. When the connection is closed, the temporary table is dropped.

In this notebook, we will explore and visualize data from the Fortune 500 companies and find the best performing companies using, sector-wise. We will use PostgreSQL temporary tables.

In Postgresql, a temporary table is created specifying TEMPORARY or TEMP when using the CREATE TABLE AS SQL command.

Example:

CREATE TEMP TABLE TOP_COMPANIES AS ( SELECT * FROM FORTUNE500 WHERE PROFITS > 0 ORDER BY PROFITS DESC LIMIT 10 ); SELECT * FROM TOP_COMPANIES ORDER BY PROFITS DESC ;

Result:

`rank\|title \|name \|ticker\|url \|hq \|sector \|industry \|employees\|revenues\|revenues_change\|profits\|profits_change\|assets \|equity\|
3	Apple	Apple, Inc.	AAPL	http://www.apple.com	Cupertino, CA	Technology	Computers, Office Equipment	116000	215639	-7.7	45687	-14.4	321686	128249
21	J.P. Morgan Chase	JPMorgan Chase & Co.	JPM	http://www.jpmorganchase.com	New York, NY	Financials	Commercial Banks	243355	105486	4.4	24733	1.2	2490972	254190
2	Berkshire Hathaway	Berkshire Hathaway Inc.	BRKA	http://www.berkshirehathaway.com	Omaha, NE	Financials	Insurance: Property and Casualty (Stock)	367700	223604	6.1	24074	0	620854	283001
25	Wells Fargo	Wells Fargo & Company	WFC	http://www.wellsfargo.com	San Francisco, CA	Financials	Commercial Banks	269100	94176	4.6	21938	-4.2	1930115	199581
27	Alphabet	Alphabet Inc.	GOOGL	http://www.abc.xyz	Mountain View, CA	Technology	Internet Services and Retailing	72053	90272	20.4	19478	19.1	167497	139036
26	Bank of America Corp.	Bank of America Corporation	BAC	http://www.bankofamerica.com	Charlotte, NC	Financials	Commercial Banks	208024	93662	0.7	17906	12.7	2187702	266840
28	Microsoft	Microsoft Corporation	MSFT	http://www.microsoft.com	Redmond, WA	Technology	Computer Software	114000	85320	-8.8	16798	37.8	193694	71997
35	Johnson & Johnson	Johnson & Johnson	JNJ	http://www.jnj.com	New Brunswick, NJ	Health Care	Pharmaceuticals	126400	71890	2.6	16540	7.3	141208	70418
30	Citigroup	Citigroup Inc.	C	http://www.citigroup.com	New York, NY	Financials	Commercial Banks	219000	82386	-6.7	14912	-13.5	1792077	225120
148	Altria Group	Altria Group, Inc.	MO	http://www.altria.com	Richmond, VA	Food, Beverages & Tobacco	Tobacco	8300	19337	2.6	14239	171.7	45932	12770

Temporary Tables vs. Common Table Expressions

Temporary tables are very similar to common table expressions in syntax and functionality.

From the PostgreSQL docs:

WITH provides a way to write auxiliary statements for use in a larger query. These statements, which are often referred to as Common Table Expressions or CTEs, can be thought of as defining temporary tables that exist just for one query.

Example:

WITH TOP_COMPANIES_CTE AS ( SELECT * FROM FORTUNE500 WHERE PROFITS > 0 ORDER BY PROFITS DESC LIMIT 10 ) SELECT * FROM TOP_COMPANIES_CTE

Result:

`rank\|title \|name \|ticker\|url \|hq \|sector \|industry \|employees\|revenues\|revenues_change\|profits\|profits_change\|assets \|equity\|
3	Apple	Apple, Inc.	AAPL	http://www.apple.com	Cupertino, CA	Technology	Computers, Office Equipment	116000	215639	-7.7	45687	-14.4	321686	128249
21	J.P. Morgan Chase	JPMorgan Chase & Co.	JPM	http://www.jpmorganchase.com	New York, NY	Financials	Commercial Banks	243355	105486	4.4	24733	1.2	2490972	254190
2	Berkshire Hathaway	Berkshire Hathaway Inc.	BRKA	http://www.berkshirehathaway.com	Omaha, NE	Financials	Insurance: Property and Casualty (Stock)	367700	223604	6.1	24074	0	620854	283001
25	Wells Fargo	Wells Fargo & Company	WFC	http://www.wellsfargo.com	San Francisco, CA	Financials	Commercial Banks	269100	94176	4.6	21938	-4.2	1930115	199581
27	Alphabet	Alphabet Inc.	GOOGL	http://www.abc.xyz	Mountain View, CA	Technology	Internet Services and Retailing	72053	90272	20.4	19478	19.1	167497	139036
26	Bank of America Corp.	Bank of America Corporation	BAC	http://www.bankofamerica.com	Charlotte, NC	Financials	Commercial Banks	208024	93662	0.7	17906	12.7	2187702	266840
28	Microsoft	Microsoft Corporation	MSFT	http://www.microsoft.com	Redmond, WA	Technology	Computer Software	114000	85320	-8.8	16798	37.8	193694	71997
35	Johnson & Johnson	Johnson & Johnson	JNJ	http://www.jnj.com	New Brunswick, NJ	Health Care	Pharmaceuticals	126400	71890	2.6	16540	7.3	141208	70418
30	Citigroup	Citigroup Inc.	C	http://www.citigroup.com	New York, NY	Financials	Commercial Banks	219000	82386	-6.7	14912	-13.5	1792077	225120
148	Altria Group	Altria Group, Inc.	MO	http://www.altria.com	Richmond, VA	Food, Beverages & Tobacco	Tobacco	8300	19337	2.6	14239	171.7	45932	12770

As we can see, the same result is achieved using a CTE (WITH)instead of a TEMP table.

Differences between temporary tables and common table expressions:

WITH statements are scoped to the query, whereas temporary tables persist for the session.
WITH can be used for other things, like Recursive Query Evaluation and creating data-modifying statements.
Temporary tables allow adding indexes.
Temporary tables are slightly less efficient.

This means that we usually should choose to use a CTE instead of a temporary table, unless we will benefit from an index.

Temporary Tables vs Common Table Expressions vs Subqueries

Another alternative is to use subqueries, as in the following example:

SELECT * FROM (SELECT * FROM FORTUNE500 WHERE PROFITS > 0 ORDER BY PROFITS DESC LIMIT 10) AS TOP_COMPANIES_SUBQUERY

The result is the same. And in many simple cases, are very similar. Since we should use temporary tables instead of CTEs only if we will benefit from an index, we will focus on the differences between CTEs and Subqueriers:

CTE vs Subqueries:

If the same subquery will be used multiple times, it is preferable to use a CTE to take advantage of encapsulation. We declare the subquery only once and call it whenever it’s needed. This doesn’t mean that the CTE will be executed only once, in many cases, even if the CTE is created once, the query optimizer might execute it many times.
A CTE can be used recursively; a sub-query cannot.

CTEs in PostgreSQL

Unile other DMBSs, like MS SQL or Oracke, PostgreSQL evaluates the query inside the CTE and stores the result (see https://medium.com/@hakibenita/be-careful-with-cte-in-postgresql-fca5e24d2119). By materializing the CTE, some CTEs in PostgreSQL can incur in a performance hit.

From the docs:

A useful property of WITH queries is that they are evaluated only once per execution of the parent query, even if they are referred to more than once by the parent query or sibling WITH queries. Thus, expensive calculations that are needed in multiple places can be placed within a WITH query to avoid redundant work. Another possible application is to prevent unwanted multiple evaluations of functions with side-effects. However, the other side of this coin is that the optimizer is less able to push restrictions from the parent query down into a WITH query than an ordinary subquery. The WITH query will generally be evaluated as written, without suppression of rows that the parent query might discard afterwards. (But, as mentioned above, evaluation might stop early if the reference(s) to the query demand only a limited number of rows.)

Different Implementation of Temporary tables according to RMDSs

In MS SQL Temporary tables are created in tempdb. Temporary tables are stored on disk. MS SQL also implements an alternative to temporary tables, called table variables
Postgresql uses a set of temporary schemas. Temporary tables are stored in memory and overflow to disk after the parameter temp_buffers in postgresql.conf is reached.
Oracle uses a temp_file
MySQL stores temporary tables in memory. Lots of temporary tables can eat up a lot of memory and cause other performance problems.
DB2 uses temporary table space It may use memory and spill to disk if the data is too large.

Objective

In this notebook, I’ll show how to:

Use sql temporary tables.
Use the ipython-sql extension run SQL commands from a jupyter notebook and get data from a database.
Store the output of SQL commands in Pandas dataframes.
Create a visualization of the data with Seaborn.

# import required modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import getpass  # for password input
from typing import List  # optional type annotations

# magic command to load the ipython-sql extension. We can connect to any database which is supported by SQLAlchemy. 
%load_ext sql

# input sudo password
password = getpass.getpass()

 ········

# start the local posrgres server
command = "/etc/init.d/postgresql start" # command to run from shell use -S as it enables input from stdin
!echo {password}|sudo -S {command} # run the command using the sudo password

[sudo] password for fede: Starting postgresql (via systemctl): postgresql.service.

If you see Starting postgresql (via systemctl): postgresql.service., postgresql has been sucessfully started

# create a connection
postgresql_pass = getpass.getpass()
%sql postgresql://fede:{postgresql_pass}@localhost/datacamp
print(r'DON\'T FORGET TO CLOSE THE CONNECTION')

 ········


DON\'T FORGET TO CLOSE THE CONNECTION

If you see Connected: username@databasename, you sucessfully created a connection to the database. Dont forget to close the connection when finishing using it.

Let’s print the first rows from table fortune500. This table contains information from fortune500 companies.

%%sql
select * from fortune500
limit 5

 * postgresql://fede:***@localhost/datacamp
5 rows affected.

rank	title	name	ticker	url	hq	sector	industry	employees	revenues	revenues_change	profits	profits_change	assets	equity
1	Walmart	Wal-Mart Stores, Inc.	WMT	http://www.walmart.com	Bentonville, AR	Retailing	General Merchandisers	2300000	485873	0.8	13643	-7.2	198825	77798
2	Berkshire Hathaway	Berkshire Hathaway Inc.	BRKA	http://www.berkshirehathaway.com	Omaha, NE	Financials	Insurance: Property and Casualty (Stock)	367700	223604	6.1	24074	0.0	620854	283001
3	Apple	Apple, Inc.	AAPL	http://www.apple.com	Cupertino, CA	Technology	Computers, Office Equipment	116000	215639	-7.7	45687	-14.4	321686	128249
4	Exxon Mobil	Exxon Mobil Corporation	XOM	http://www.exxonmobil.com	Irving, TX	Energy	Petroleum Refining	72700	205004	-16.7	7840	-51.5	330314	167325
5	McKesson	McKesson Corporation	MCK	http://www.mckesson.com	San Francisco, CA	Wholesalers	Wholesalers: Health Care	68000	192487	6.2	2258	53.0	56563	8924

Besides showing how to use SQL temporary tables, the purpose of this notebook is to find those companies that have profits larger than the 80th percentile, per sector.

The first step is to create a temporary table that will hold the 80th percentile per sector.

Percentiles

The SQL command to get the percentile is percentile_disc. ‘disc’ stands for discrete. Another option would be to use percentile_cont to find the percentile assuming a continuous distribution of values. percentile_disc uses an actual value from the column.

%%sql 
create temporary table percentiles as
select sector, percentile_disc(0.8) within group(order by profits) perc80
from fortune500
group by sector;

 * postgresql://fede:***@localhost/datacamp
21 rows affected.





[]

%%sql
-- check the 80th percentile per sector
select *
from percentiles 
order by perc80 desc;

 * postgresql://fede:***@localhost/datacamp
21 rows affected.

sector	perc80
Telecommunications	8695
Technology	7266
Food, Beverages & Tobacco	6073
Health Care	4965
Aerospace & Defense	4895
Motor Vehicles & Parts	4596
Industrials	3695
Financials	3014
Media	2755
Transportation	2676
Household Products	2166
Food & Drug Stores	2025.7
Hotels, Restaurants & Leisure	1670
Chemicals	1500
Business Services	1401
Energy	1311
Retailing	1228
Apparel	1074.1
Wholesalers	605.9
Engineering & Construction	602.7
Materials	496

# store the values of this SQL temporary table to a an ipython-sql sql.run.ResultSet 
# I'll convert this result set to a pandas dataframe to visualize it.
perc80 = %sql select * from percentiles

 * postgresql://fede:***@localhost/datacamp
21 rows affected.

# convert the sql result set to a dataframe
df_perc80: pd.DataFrame = perc80.DataFrame()

# sanity check
df_perc80.head()

	sector	perc80
0	Aerospace & Defense	4895
1	Apparel	1074.1
2	Business Services	1401
3	Chemicals	1500
4	Energy	1311

# I sort the values in descending order, to create the visualization
df_perc80.sort_values(by='perc80', ascending=False, inplace=True)

Now, I’ll create a visualization using Seaborn. I’ll use a barplot.

# create a barplot
sns.set(style="darkgrid")
sns.set_palette("Blues")

plt.figure(figsize=(15,10))
ax = sns.barplot(data=df_perc80,y='sector',x='perc80', palette="Blues_r") #swith x and y for a horizontal barplot
ax.set_xlabel('80th percentile of profits', fontsize=10)
ax.set_title('80th Percentile per Sector',fontsize= 30) # title of plot

Text(0.5, 1.0, '80th Percentile per Sector')

png

We can see that there is a big difference between 80th percentiles per sector. Telcos, Techs and Food companies have large 80th percentiles of profits.
Companies in the Wholesalers, Engineering & Construction and Materials sectors have much lower 80th percentile of profits.

Companies with large profits, per sector

Now, I want find those companies that have profits above the 80th percentile per sector. Afterward, I’ll create a visualization of this data.

The first step, will be to create a list of companies that have profits above the 80th of the corresponding sector. I’ll need a sector column.

Also, I’ll join this data with the previously created temporary table, to have a column with the 80th percentile of profits (column perc80)

I’ll also create a ratio column to compare the profits of each company, with the 80th percentile. Of course, since I’m selecting only companies that have profits higher than the 80th percentile, this ratio will be always higher than 100.

%%sql
-- now create a list of companies above 80th percentile, per sector
select title, f.sector, profits, perc80, round(f.profits / p.perc80 * 100,0) ratio
from fortune500 f
left join percentiles p -- join whith the previouly created temporary table
on f.sector = p.sector
where f.profits >= p.perc80 -- only companies with profits above the 80th percentile
order by profits desc
limit 10 -- for demonstration purposes, I'll limit the results to only 10 companies, this limit will not be used below

 * postgresql://fede:***@localhost/datacamp
10 rows affected.

title	sector	profits	perc80	ratio
Apple	Technology	45687	7266	629
J.P. Morgan Chase	Financials	24733	3014	821
Berkshire Hathaway	Financials	24074	3014	799
Wells Fargo	Financials	21938	3014	728
Alphabet	Technology	19478	7266	268
Bank of America Corp.	Financials	17906	3014	594
Microsoft	Technology	16798	7266	231
Johnson & Johnson	Health Care	16540	4965	333
Citigroup	Financials	14912	3014	495
Altria Group	Food, Beverages & Tobacco	14239	6073	234

We can see that companies with the largest profits are:

Apple
JP Morgan
Berkshire Hathaway
Wells Fargo
Alphabet (Google)

We can also see that most companies in the top 10 are from Tech and Financials sectors.

# I save the same previous SQL command as a string, for later execution. Note that I removed the `limit` to get all results and not only the top 10
query_string: str = """
select title, f.sector, profits, perc80, round(f.profits / p.perc80 * 100,0) ratio \n
from fortune500 f \n
left join percentiles p \n
on f.sector = p.sector \n
where f.profits >= p.perc80
order by profits desc
"""

# get the query results as a sql result set
sql_result = %sql $query_string
df = sql_result.DataFrame()   # convert it to a Pandas dataframe
df.head(5) # sanity check

 * postgresql://fede:***@localhost/datacamp
111 rows affected.

	title	sector	profits	perc80	ratio
0	Apple	Technology	45687	7266	629
1	J.P. Morgan Chase	Financials	24733	3014	821
2	Berkshire Hathaway	Financials	24074	3014	799
3	Wells Fargo	Financials	21938	3014	728
4	Alphabet	Technology	19478	7266	268

# convert columns to integer type
df[["profits", "perc80", "ratio"]] = df[["profits", "perc80", "ratio"]].apply(
    pd.to_numeric
)

Visualization

Now I’ll create a visualization of the companies that have profits above the 80th percentile of their sector.

# I create a list of sectors, this will be used later for annotating the plot
sectors: np.array = df["sector"].unique()  # get a list of sectors from the datafram
sectors: List[str] = list(sectors)  # convert the numpy array to a python list
print(sectors)  # sanity check

['Technology', 'Financials', 'Health Care', 'Food, Beverages & Tobacco', 'Retailing', 'Telecommunications', 'Household Products', 'Motor Vehicles & Parts', 'Media', 'Industrials', 'Energy', 'Business Services', 'Aerospace & Defense', 'Hotels, Restaurants & Leisure', 'Transportation', 'Chemicals', 'Food & Drug Stores', 'Apparel', 'Wholesalers', 'Materials', 'Engineering & Construction']

Now, I’ll use matplotlib Artist layer to create an horizontal barplot.

plt.figure(figsize=(5,6))
ax = sns.scatterplot(data=df,x='profits',y='sector', s=300, hue='profits') # I use

png

fig, ax = plt.subplots(figsize=(10, 20))
ax = plt.scatter(df.profits, df.sector, c=["blue"])  # , s=300, hue='profits') # I use
# plt.annotate("Apple", xy=(45687+450, 0))
# for i in range(len(ax._offsets)):
#    plt.annotate('X', xy=(ax._offsets[i][0],ax._offsets[i][1]))

for i in [0]:
    for s in sectors:
        title = df[df["sector"] == s].iloc[i].loc["title"]
        profits = int(df[df["sector"] == s].iloc[i].loc["profits"])
        # print(profits)
        # print(type(profits))
        # plt.annotate(title,  xy=(profits,sectors.index(s))) # annotate doesn't support rotation
        plt.text(profits, sectors.index(s), title, rotation=45)
        # print(df[df['sector'] == s].iloc[i].loc['title'])
        # print(df[df['sector'] == s].iloc[i].loc['profits'])
# plt.text(1000, 1, 'matplotlib', rotation=45)
plt.show()

png

# close the SQL connection or stop the database server
# WARNING: Double check if used in production
command_restart = "/etc/init.d/postgresql restart"  # to close all the connections and restart the server
command_close = "/etc/init.d/postgresql stop"  # to stop the server
!echo {password}|sudo -S {command_close}

[sudo] password for fede: Stopping postgresql (via systemctl): postgresql.service.

Moving-On

In a later notebook, I performed this same analysis using Python Pandas. Feel free to check it out:

Analysis of Fortune500 companies profits with Pandas

sql databases

Last updated on Mar 13, 2020

Edit this page