Correation in SQL

The correlation is a statistical measure that shows how closely related are two sets of values.

The most used correlation measure used is the Pearson’s Correlation Coefficient. The Pearson’s Correlation Coefficient gives:

  • a value of 1 for a perfect correation.
  • a value of 0 for no correlation.
  • a value of -1 for a perfect negative correlation

Examples of Positive and Negative Correlation:

  • Positive Correlation: marketing expenditure and sales, price of a competitor product and sales.
  • Negative Correlation: price of a product and sales of that product, price of a complementary product and sales.

Correlation vs. Causation

A positive correlation between two variables doesn’t nececssarily means that there is a causal relationship between those variables. For instance, a strong positive or negative correlation between two variables could be caused by randomness.

Additionally, while there might be a causal relationship between two variables and a strong correlation between them, the correlation coefficient doesn’t tell us anything in relation to the directionality of the causal relationship.

Correlation in SQL

Calculating the correlation coefficient in SQL is pretty straightforward. The corr(Y, X) function returns the correlation coefficient between a set of variables.

Support

The corr function is supported by:

  • PostgreSQL
  • Oracle

It is not directly supported by:

  • MySQL
  • MS-SQL
  • Sqlite
import getpass  # for password input
# magic command to load the ipython-sql extension. We can connect to any 
# database which is supported by SQLAlchemy. 
%load_ext sql
# create a connection
postgresql_pass = getpass.getpass()
%sql postgresql://fede:{postgresql_pass}@localhost/datacomy
print(r'DON\'T FORGET TO CLOSE THE CONNECTION')
 ········


DON\'T FORGET TO CLOSE THE CONNECTION
%%sql
SELECT * FROM  "auto" 
limit 5
 * postgresql://fede:***@localhost/datacomy
5 rows affected.
id Unnamed: 0 mpg cylinders displacement horsepower weight acceleration year origin name
0 0 18.0 8 307.0 130.0 3504.0 12.0 70 1 chevrolet chevelle malibu
1 1 15.0 8 350.0 165.0 3693.0 11.5 70 1 buick skylark 320
2 2 18.0 8 318.0 150.0 3436.0 11.0 70 1 plymouth satellite
3 3 16.0 8 304.0 150.0 3433.0 12.0 70 1 amc rebel sst
4 4 17.0 8 302.0 140.0 3449.0 10.5 70 1 ford torino

Examples on how to use the corr function

%%sql
SELECT CORR (DISPLACEMENT , MPG )
from AUTO A 
 * postgresql://fede:***@localhost/datacomy
1 rows affected.
corr
-0.8042028248059
%%sql
SELECT CORR (DISPLACEMENT , WEIGHT )
from AUTO A 
 * postgresql://fede:***@localhost/datacomy
1 rows affected.
corr
0.932824146841634
%%sql
SELECT CORR (MPG , WEIGHT )
from AUTO A 
 * postgresql://fede:***@localhost/datacomy
1 rows affected.
corr
-0.831740933244338
%%sql
SELECT CORR (MPG , ACCELERATION )
from AUTO A 
 * postgresql://fede:***@localhost/datacomy
1 rows affected.
corr
0.420288912101655

More about the correlation SQL function

The correlation function is an aggregate function, it computes a single result from a set of input values.

Related