How do you plot a simple linear regression in python?
Show
Coding a line of best fitSimple linear regression is a concept that you may be familiar with already from middle school or high school. If you have ever heard of a slope and an intercept, or y = mx + b, then you have already learned about simple linear regression! What is Simple Linear Regression?Simple linear regression is a statistical method that we can use to find a relationship between two variables and make predictions. The two variables used are typically denoted as y and x.The independent variable, or the variable used to predict the dependent variable is denoted as x. The dependent variable, or the outcome/output, is denoted as y. A simple linear regression model will produce a line of best fit, or the regression line. You may have heard about drawing the line of best fit through a scatter plot of data. For example, let's say we have a scatter plot showing how years of experience affect salaries. Imagine drawing a line to predict the trend. The simple linear regression equation we will use is written below. The constant is the y-intercept (𝜷0), or where the regression line will start on the y-axis. The beta coefficient (𝜷1) is the slope and describes the relationship between the independent variable and the dependent variable. The coefficient can be positive or negative and is the degree of change in the dependent variable for every 1-unit of change in the independent variable. For example, let's say we have a regression equation of y = 2 + 0.5x. For every 1-unit increase in the independent variable (x), there will be a 0.50 increase in the dependent variable (y). Simple Linear Regression Using PythonFor this example, we will be using salary data from Kaggle. The data consists of two columns, years of experience and the corresponding salary. The data can be found here. First, we will import the Python packages that we will need for this analysis. All we will need is NumPy, to help with the math calculations, Pandas, to store and manipulate the data and Matplotlib (optional), to plot the data. import numpy as np Next, we will load in the data and then assign each column to its appropriate variable. For this example, we will be using the years of experience to predict the salary, so the dependent variable will be the salary (y) and the independent variable will be the years of experience (x). data = pd.read_csv('Salary_Data.csv') To get a look at the data we can use the print(data.head()) Above is a scatter plot showing our data. We can see a positive linear relationship between Years of Experience and Salary, meaning that as a person gains more experience, they also get paid more. Calculating the Regression LineWhile we could spend all day guessing the slope and intercept of the linear regression line, luckily there are formulas that we can use to quickly make these calculations. To estimate the slope 𝜷1 of the data we will use the following formula: To estimate the intercept 𝜷0, we can use the following formula: Now we will have to translate these two formulas to Python to calculate the regression line. First I will show the full function, then I will break it down further. def linear_regression(x, y): First, we will use the N = len(x) Now we can begin to calculate the slope 𝜷1. To shorten the length of these lines of code, we can calculate the numerator and denominator of the slope formula first then divide the numerator by the denominator and assign it to a variable named 𝜷1. We can just follow the slope formula given above. B1_num = ((x - x_mean) * (y - y_mean)).sum() Now that we have calculated the slope 𝜷1, we can use the formula for the intercept 𝜷0. B0 = y_mean - (B1 * x_mean) Now if we apply this Calculating How Well The Regression Line FitsTo determine how well our regression line fits the data, we want to calculate the correlation coefficient, commonly referred to just as R,and the coefficient of determination, otherwise known as R² (R squared). Coefficient of Determination (R²) — The percentage of variance explained by the independent variable (x) with values between 0 and 1. It cannot be negative because it is a square value. For example, if R² = 0.81, then this tells you that x explains 81% of the variance in y. Otherwise known as the “goodness of fit”. Correlation Coefficient (R) — The degree of relationship or correlation between two variables (x and y in this case). R can range from -1 to 1 with values equal to 1 meaning a perfect positive correlation and values equal to -1 meaning a perfect negative correlation. Below is the formula for Pearson’s correlation coefficient: We will have to convert this formula into Python code. Once we calculate Pearson’s correlation coefficient, we can simply square it to get the coefficient of determination. We will need to store the number of observations (rows in data) in the variable N again. Next, we will split the formula into two parts: the numerator and the denominator. We can then return the correlation coefficient. def corr_coef(x, y): Applying these functions to our data, we can print out the results: B0, B1, reg_line = linear_regression(x, y) Plotting the Regression LineThis part is completely optional and is just for fun. Using Matplotlib, we can now plot our resulting regression line with our data. plt.figure(figsize=(12,5)) Now we can use our calculations of the regression line to make predictions with new data that we come across. To create the def predict(B0, B1, new_x): I hope this helped you learn or review the process of simple linear regression. Multiple linear regression and polynomial regression will be topics I will touch on in later articles. How do you graph a linear regression in Python?How to plot a linear regression line on a scatter plot in Python. x = np. array([1, 3, 5, 7]) generate data. y = np. array([ 6, 3, 9, 5 ]). plt. plot(x, y, 'o') create scatter plot.. m, b = np. polyfit(x, y, 1) m = slope, b=intercept.. plt. plot(x, m*x + b) add line of best fit.. What is regression plot in Python?The regression plots in Seaborn library of Python are primarily intended to add a visual guide that helps to emphasize patterns in a dataset during exploratory data analysis. As the name suggests Regression plots, creates a regression line between 2 parameters and helps to visualize their linear relationships.
How do you represent a simple linear regression?The formula for a simple linear regression is:. y is the predicted value of the dependent variable (y) for any given value of the independent variable (x).. B0 is the intercept, the predicted value of y when the x is 0.. B1 is the regression coefficient – how much we expect y to change as x increases.. |