Linear Regression Explained with Example
Updated on Sep 23, 2022 | 7 min read | 6.4k views
Share:
For working professionals
For fresh graduates
More
Updated on Sep 23, 2022 | 7 min read | 6.4k views
Share:
Table of Contents
Linear regression is one of the most common algorithms for establishing relationships between the variables of a dataset. A mathematical model is a necessary tool for data scientists in performing predictive analysis. This blog will fill you in on the fundamental concept and also discuss a linear regression example.
A regression model describes the relationship between dataset variables by fitting a line to the data observed. It is a mathematical analysis that sorts out which variables have an impact and matter the most. It also determines how certain we are about the factors involved. The two kinds of variables are:
Regression models are used when the dependent variable is quantitative. It may be binary in the case of logistic regression. But in this blog, we will mainly focus on the linear regression model where both variables are quantitative.
Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.
Suppose you have data on the monthly sales and average monthly rainfall for the past three years. Let’s say that you plotted this information on a chart. The y-axis represents the number of sales (dependent variable), and the x-axis depicts the total rainfall. Each dot on the chart would show how much it rained during a particular month and the corresponding sales numbers.
If you take another glance at the data, you might notice a pattern. Presume the sales to be higher on the days it rained more. But it would be tricky to estimate how much you would typically sell when it rained a certain amount, say 3 or 4 inches. You could get some degree of certainty if you drew a line through the middle of all data points on the chart.
Nowadays, Excel and statistics software like SPSS, R, or STATA can help you draw a line that best fits the data at hand. In addition, you can also output a formula explaining the slope of the line.
Consider this formula for the above example: Y = 200 + 3X. It tells you that you sold 200 units when it didn’t rain at all (i.e., when X=0). Assuming that the variables stay the same as we advance, every additional inch of rain would result in an average sales of three more units. You would sell 203 units if it rains 1 inch, 206 units if it rains 2 inches, 209 inches if it rains 3 inches, and so on.
Typically, the regression line formula also includes an error term (Y = 200 + 3 X + error term). It takes into account the reality that independent predictors may not always be perfect predictors of dependent variables. And the line merely gives you an estimate based on the data available. The larger the error term, the less certain would be your regression line.
A simple linear regression model uses a straight line to estimate the relationship between two quantitative variables. If you have more than one independent variable, you will use multiple linear regression instead.
Simple linear regression analysis is concerned with two things. First, it tells you the strength of the relationship between the dependent and independent factors of the historical data. Second, it gives you the value of the dependent variable at a certain value of the independent variable.
Consider this linear regression example. A social researcher interested in knowing how individuals’ income affects their happiness levels performs a simple regression analysis to see if a linear relationship occurs. The researcher takes quantitative values of the dependent variable (happiness) and independent variable (income) by surveying people in a particular geographical location.
For instance, the data contains income figures and happiness levels (ranked on a scale from 1 to 10) from 500 people from the Indian state of Maharashtra. The researcher would then plot the data points and fit a regression line to know how much the respondents’ earnings influence their wellbeing.
Linear regression analysis is based on a few assumptions about the data. There are:
y = c + ax is a standard equation where y is the output (that we want to estimate), x is the input variable (that we know), a is the slope of the line, and c is the constant.
Here, the output varies linearly based on the input. The slope determines how much x impacts the value of y. The constant is the value of y when x is nil.
Let’s understand this through another linear regression example. Imagine that you are employed in an automobile company and want to study India’s passenger vehicle market. Let’s say that the national GDP influences passenger vehicle sales. To plan better for the business, you might want to find out the linear equation of the number of vehicles sold in the country concerning the GDP
For this, you would need sample data for year-wise passenger vehicle sales and the GDP figures for every year. You might discover that the GDP of the current year affects the sales for next year: Whichever year the GDP was less, vehicle sales were lower in the subsequent year.
To prepare this data for Machine Learning analytics, you would need to do a little more work.
Check out this tutorial to understand the step-by-step method
If you were to perform simple linear regression in R, interpreting and reporting results become much easier.
For the same linear regression example, let us change the equation to y=B0 + B1x + e. Again, y is the dependent variable, and x is the independent or known variable. B0 is the constant or intercept, B1 is the slope of the regression coefficient, and e is the error of the estimate.
Statistical software like R can find the line of best fit through the data and search for the B1 that minimises the total error of the model.
Note: The output would contain results like calls, Residuals, and Coefficients. The ‘Call’ table states the formula used. The ‘Residuals’ details the Median, Quartiles, minimum, and maximum values to indicate how well the model fits the real data. The first row of the ‘Coefficients’ table estimates the y-intercept, and the second row gives the regression coefficient. The columns of this table have labels like Estimate, Std. Error, t value, and p-value.
With the above linear regression example, we have given you an overview of generating a simple linear regression model, finding the regression coefficient, and calculating the error of the estimate. We also touched upon the relevance of Python and R for predictive data analytics and statistics. Practical knowledge of such tools is crucial for pursuing careers in data science and machine learning today.
If you want to hone your programming skills, check out the Advanced Certificate Programme in Machine Learning by IIT Madras and upGrad. The online course also includes case studies, projects, and expert mentorship sessions to bring industry-orientedness to the training process.
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Top Resources