In this short guide, I’ll show you how to create a Correlation Matrix using Pandas. I’ll also review the steps to display the matrix using Seaborn and Matplotlib.
To start, here is a template that you can apply in order to create a correlation matrix using pandas:
df.corr[]
Next, I’ll show you an example with the steps to create a correlation matrix for a given dataset.
Step 1: Collect the Data
Firstly, collect the data that will be used for the correlation matrix.
For example, I collected the following data about 3 variables:
A | B | C |
45 | 38 | 10 |
37 | 31 | 15 |
42 | 26 | 17 |
35 | 28 | 21 |
39 | 33 | 12 |
Step 2: Create a DataFrame using Pandas
Next, create a DataFrame in order to capture the above dataset in Python:
import pandas as pd data = {'A': [45,37,42,35,39], 'B': [38,31,26,28,33], 'C': [10,15,17,21,12] } df = pd.DataFrame[data,columns=['A','B','C']] print [df]
Once you run the code, you’ll get the following DataFrame:
Step 3: Create a Correlation Matrix using Pandas
Now, create a correlation matrix using this template:
df.corr[]
This is the complete Python code that you can use to create the correlation matrix for our example:
import pandas as pd data = {'A': [45,37,42,35,39], 'B': [38,31,26,28,33], 'C': [10,15,17,21,12] } df = pd.DataFrame[data,columns=['A','B','C']] corrMatrix = df.corr[] print [corrMatrix]
Run the code in Python, and you’ll get the following matrix:
Step 4 [optional]: Get a Visual Representation of the Correlation Matrix using Seaborn and Matplotlib
You can use the seaborn and matplotlib packages in order to get a visual representation of the correlation matrix.
First import the seaborn and matplotlib packages:
import seaborn as sn import matplotlib.pyplot as plt
Then, add the following syntax at the bottom of the code:
sn.heatmap[corrMatrix, annot=True] plt.show[]
So the complete Python code would look like this:
import pandas as pd import seaborn as sn import matplotlib.pyplot as plt data = {'A': [45,37,42,35,39], 'B': [38,31,26,28,33], 'C': [10,15,17,21,12] } df = pd.DataFrame[data,columns=['A','B','C']] corrMatrix = df.corr[] sn.heatmap[corrMatrix, annot=True] plt.show[]
Run the code, and you’ll get the following correlation matrix:
That’s it! You may also want to review the following source that explains the steps to create a Confusion Matrix using Python. Alternatively, you may check this guide about creating a Covariance Matrix in Python.
A correlation matrix is a table containing correlation coefficients between variables. Each cell in the table represents the correlation between two variables. The value lies between -1 and 1. A correlation matrix is used to summarize data, as a diagnostic for advanced analyses and as an input into a more advanced analysis. The two key components of the correlation are:
- Magnitude: larger the magnitude, stronger the correlation.
- Sign: if positive, there is a regular correlation. If negative, there is an inverse correlation.
A correlation matrix has been created using the following two libraries:
- Numpy Library
- Pandas Library
Method 1: Creating a correlation matrix using Numpy library
Numpy library make use of corrcoef[] function that returns a matrix of 2×2. The matrix consists of correlations of x with x [0,0], x with y [0,1], y with x [1,0] and y with y [1,1]. We are only concerned with the correlation of x with y i.e. cell [0,1] or [1,0]. See below for an example.
Example 1: Suppose an ice cream shop keeps track of total sales of ice creams versus the temperature on that day.
Python3
import
numpy as np
x
=
[
215
,
325
,
185
,
332
,
406
,
522
,
412
,
614
,
544
,
421
,
445
,
408
],
y
=
[
14.2
,
16.4
,
11.9
,
15.2
,
18.5
,
22.1
,
19.4
,
25.1
,
23.4
,
18.1
,
22.6
,
17.2
]
matrix
=
np.corrcoef[x, y]
print
[matrix]
Output
[[1. 0.95750662] [0.95750662 1. ]]
From the above matrix, if we see cell [0,1] and [1,0] both have the same value equal to 0.95750662 which lead us to conclude that whenever the temperature is high we have more sales.
Example 2: Suppose we are given glucose level in boy respective to age. Find correlation between age[x] and glucose level in body[y].
Python3
import
numpy as np
x
=
[
43
,
21
,
25
,
42
,
57
,
59
]
y
=
[
99
,
65
,
79
,
75
,
87
,
81
]
matrix
=
np.corrcoef[x, y]
print
[matrix]
Output
[[1. 0.5298089] [0.5298089 1. ]]
From the above correlation matrix, 0.5298089 or 52.98% that means the variable has a moderate positive correlation.
Method 2: Creating correlation matrix using Pandas library
In order to create a correlation matrix for a given dataset, we use corr[] method on dataframes.
Example 1:
Python3
import
pandas as pd
data
=
{
'x'
: [
45
,
37
,
42
,
35
,
39
],
'y'
: [
38
,
31
,
26
,
28
,
33
],
'z'
: [
10
,
15
,
17
,
21
,
12
]
}
dataframe
=
pd.DataFrame[data, columns
=
[
'x'
,
'y'
,
'z'
]]
print
[
"Dataframe is : "
]
print
[dataframe]
matrix
=
dataframe.corr[]
print
[
"Correlation matrix is : "
]
print
[matrix]
Output:
Dataframe is : x y z 0 45 38 10 1 37 31 15 2 42 26 17 3 35 28 21 4 39 33 12 Correlation matrix is : x y z x 1.000000 0.518457 -0.701886 y 0.518457 1.000000 -0.860941 z -0.701886 -0.860941 1.000000
Example 2:
CSV File used:
Python3
import
pandas as pd
dataframe
=
pd.read_csv[
"C:\\GFG\\sample.csv"
]
print
[dataframe]
matrix
=
dataframe.corr[]
print
[
"Correlation Matrix is : "
]
print
[matrix]
Output:
Correlation Matrix is : AVG temp C Ice Cream production AVG temp C 1.000000 0.718032 Ice Cream production 0.718032 1.000000