Correation in SQL
The correlation is a statistical measure that shows how closely related are two sets of values.
The most used correlation measure used is the Pearson’s Correlation Coefficient. The Pearson’s Correlation Coefficient gives:
- a value of 1 for a perfect correation.
- a value of 0 for no correlation.
- a value of -1 for a perfect negative correlation
Examples of Positive and Negative Correlation:
- Positive Correlation: marketing expenditure and sales, price of a competitor product and sales.
- Negative Correlation: price of a product and sales of that product, price of a complementary product and sales.
Correlation vs. Causation
A positive correlation between two variables doesn’t nececssarily means that there is a causal relationship between those variables. For instance, a strong positive or negative correlation between two variables could be caused by randomness.
Additionally, while there might be a causal relationship between two variables and a strong correlation between them, the correlation coefficient doesn’t tell us anything in relation to the directionality of the causal relationship.
Correlation in SQL
Calculating the correlation coefficient in SQL is pretty straightforward. The corr(Y, X)
function returns the correlation coefficient between a set of variables.
Support
The corr
function is supported by:
- PostgreSQL
- Oracle
It is not directly supported by:
- MySQL
- MS-SQL
- Sqlite
import getpass # for password input
# magic command to load the ipython-sql extension. We can connect to any
# database which is supported by SQLAlchemy.
%load_ext sql
# create a connection
postgresql_pass = getpass.getpass()
%sql postgresql://fede:{postgresql_pass}@localhost/datacomy
print(r'DON\'T FORGET TO CLOSE THE CONNECTION')
········
DON\'T FORGET TO CLOSE THE CONNECTION
%%sql
SELECT * FROM "auto"
limit 5
* postgresql://fede:***@localhost/datacomy
5 rows affected.
id | Unnamed: 0 | mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 18.0 | 8 | 307.0 | 130.0 | 3504.0 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
1 | 1 | 15.0 | 8 | 350.0 | 165.0 | 3693.0 | 11.5 | 70 | 1 | buick skylark 320 |
2 | 2 | 18.0 | 8 | 318.0 | 150.0 | 3436.0 | 11.0 | 70 | 1 | plymouth satellite |
3 | 3 | 16.0 | 8 | 304.0 | 150.0 | 3433.0 | 12.0 | 70 | 1 | amc rebel sst |
4 | 4 | 17.0 | 8 | 302.0 | 140.0 | 3449.0 | 10.5 | 70 | 1 | ford torino |
Examples on how to use the corr function
%%sql
SELECT CORR (DISPLACEMENT , MPG )
from AUTO A
* postgresql://fede:***@localhost/datacomy
1 rows affected.
corr |
---|
-0.8042028248059 |
%%sql
SELECT CORR (DISPLACEMENT , WEIGHT )
from AUTO A
* postgresql://fede:***@localhost/datacomy
1 rows affected.
corr |
---|
0.932824146841634 |
%%sql
SELECT CORR (MPG , WEIGHT )
from AUTO A
* postgresql://fede:***@localhost/datacomy
1 rows affected.
corr |
---|
-0.831740933244338 |
%%sql
SELECT CORR (MPG , ACCELERATION )
from AUTO A
* postgresql://fede:***@localhost/datacomy
1 rows affected.
corr |
---|
0.420288912101655 |
More about the correlation SQL function
The correlation function is an aggregate function, it computes a single result from a set of input values.