Least Squares Regression Line and How to Calculate it from your Data.
“Systems of linear equations are considered over determined when more equations exist for the problem than unknowns.”
Seen from a different perspective, calculating least squares regression lines allows you to find the correct equation to a problem.
Least squares itself shows that the solution calculated reduces the deviation in results of every equation concerned. This allows you to reduce overall error in any equations used to determine data and can be used to minimize residual values effecting any results. Least squares regression lines are often used for data fitting applications where more accuracy is needed. This application of data fitting allows you to create a more accurate best fit data and come to more accurate conclusion.
In statistical and mathematical cases, linear least squares are exceptionally helpful as they allow you to fit data to any model with any number provided by any model for any data point. This results in a model that is then used to summarize data as well as predicting unobserved values from the same model/system and can be to understand and discover any new mechanisms that the system acts on. Statistics is difficult to get your head around, yet it is so much easier when you have some guidance. Taking a course in Statistics might help you use least squares regression line with greater confidence.
Linear least squares only applies to linear equations. As the name suggests, linear least squares deal with linear functions and parameters only. This contrasts with non-linear least square techniques which must be solved with trial and error.
Statistically, linear least squares problems allow the analysis of a very useful type of statistical model called linear regression, a type of analysis that is formed from least squares regression lines.
The simplest variant of linear least squares is the standard least squares model.
Linear Least square regression is the de-facto method for finding lines of best fit that summarize a relationship between any two given variables, constrained by a variable x. This in formula looks like this and is easy to use with a little practice and thought.
Y = a + b X
b = r SDY/SDx
a= !y – b!X
This formula is used to create a graph, with the X explanatory variable and Y dependent variable. This graph has a slope equal to b, and a is the intercept point of the graph.
This intercept point is defined as when the value of y when x = 0.
For help with graphing we recommend that you use computer software like Microsoft Excel. For help with advanced statistics using this software take a look at Statistics in Microsoft Excel
As mentioned above, linear regression is often used to model the relationship between two variables. This means you can fit a linear equation to your collected data. Variable X is the variable that explains the relationships where other variables are dependent on it. As a demonstration, a clothes tailor wants to figure out the relationships between the weights of his/her customers to his/her customer’s heights. He/she could do this using a linear regression model, so that in the future he/she can extrapolate the data and save time measuring from the ground up.
When you try to use least squares regression lines to find the linear model to your collected data, you always determine if you can find a relation between any of your variables of interest. Wildly guessing at things at this point will lead to highly inaccurate answers, so don’t go mad and start relating height to eye colour. This means you should try to figure out things that have meaningful relationships, whilst being mindful in thinking whether or not it is the correct relationship. Some examples of this would be higher early exam scores do not necessarily mean better university grades in the future, although there probably is some relationship. This would be a good variable to think about when trying to use least squares regression lines.
Statistical analysis can be universally useful and studying some Descriptive Statistics is a great addition to your CV.
To go back to our tailor example, height and shoe size could have correlation with shoe size being the expiatory variable.
Some people use Scatter Graphs to determine any type of relationship and how strong the relationship is between any two sets of data. This is helpful as a Scatter Graph can expose relationships in a simple and vial manner where tabulated data may be more difficult to extract collected data from. Generally, if you see very little or no correlation on a Scatter Graph, it’s safe to assume that plotting a least squares regression line will be a waste of time. The Scatter Graph should indicate increasing or decreasing values and relationships. The least squares regression line will not be a worthy use of your time if there is no pattern to create the line with.
After creating your Scatter Graph, make sure you calculate the number corresponding to your correlation coefficient. This is handy and can show you the value of the data and how they correlate. This value is usually from -1 to +1.
Least squares regression line is used to calculate the best fit line in such a way to minimize the difference in the squares of any data on a given line. This means the further away from the line the data point is, the more pull it has on the line. Also, this means that if a data point is exactly on the best fit line, it has an effective deviation of 0. The values are squared, so no negative values can cancel out the positive values, making the least squares regression line more accurate.
To view your data you should plot the results of your calculated regression against the real data dots to analyse the results of your data. If your data has a lot of correlation, your data points will mostly be clustered together and follow your calculated least squares regression line. Any points that are lie away from the main cluster of data dots are a flag for you to look out for. These points, known as outliers or abnormalities are a warning sign. These points can show data that has been miscalculated, plotted incorrectly or even erroneous original data. Of course, if none of these scenarios are true, this is a true statistical outlier and should warrant further investigation. Going back to our tailor example, we could find a statistical outlier where someone very tall had very odd weight for his or her height. This kind of data point would throw a spanner in the works of your least squares regression line as it can pull the least squares regression line away from its correct placement meaning any information you extrapolate from the graph and least squares regression line to be less accurate. Sometimes it is best to remove the outliers from your graph and do the calculations again, especially if there are not many outliers. If there are many outliers, you must either check your workings or recollect the data to confirm your hypothesis.
As your outliers may be not be false data there is a chance that your data has a poor fitting least squares regression line. If the point is a large distance from the line in a horizontal direction, make sure you think about that data dot. Removing any erroneous data means that the least squares regression line will fit the data much better. This in turn increases the correlation value between any data observed and allows you to make more accurate observations. This impact is significant and shouldn’t be ignored.
After you have calculated your least squares regression line, analyse the outlying data with great scrutiny. These deviations mean you can analyse your original claim with more accuracy. You could plot these values on one axis of a separate graph and plot them against your other results to see if there is any correlation. Occasionally, this will bring to your attention more variables that may be important to your original hypothesis and lead to further investigation and data gathering. All of this only makes the original data more useful. Using a least squares regression line in your graphs can also point out any non-linearities in your collected data, which could also show you where your data is incorrect.
Assuming you have created a least squares regression line and analysed your resulting graphs and data, you might find these non-linearities. If you happen to find a non-linear trend in your data with a relationship between X and other variables, you could have something called influential variables to think about. These influential variables appear when the data’s relationships with each variable is affected by a fluctuating or significant third party variable like Wind Speed, Humidity, Air Pressure and so on. Things that are affected by these things tend to be highly affected by them. Take a guitar for example. A guitar stays in tune by using tension and counter tension. If you tune a guitar in one room, then take it somewhere with different temperature or humidity, the guitar will become out of tune fairly quickly. These lurking or influential variables (temperature, pressure) highly influence the desired outcome of the guitar being in tune. These influential and non linear values can really throw your least squares regression line off, making it inaccurate.
If you fail to add your influential variables into the modeling effort, you’ll quickly find that looking for nonlinearities will show you what else you need to model to get a correct least squares regression line.
Eventually, you will have an accurate least squares regression line, which can be used to try and predict or extrapolate data to apply to things you have no data on currently. Whenever you fit data to a least squares regression line you need to be highly careful and mindful about what ranges you are applying a least squares regression line to. Giving the data range to the least squares regression line incorrectly will produce inaccurate results that are not suitable to perform extrapolation on in the future.
These inaccurate predictions can be vastly wrong and damaging, especially if there were any bad pieces of data in the first set of data before applying least squares regression line. Using least squares regression line just to make data fit isn’t necessarily a good use of your time as attempting to use a regression best fit line to predict values will not work. When the data obtained is accurate and the least squares regression line makes sense, you can then begin to extrapolate information, bearing in mind any limitations based on the original data. Back to the tailor example, if all of the tailors customers are male/female adults and the tailor tries to use their regression model to figure out the average size for every customer, the tailor will be highly inaccurate when it comes to dealing with children and teenagers. The height/weight distribution is only valid when applied to any given age group, so the data extrapolated from the least squares regression line should only ever be used in those circumstances. Extrapolation is a highly helpful technique and is used in many businesses, research papers and other mathematically based items. In business it can be used to project profit margins and stock prices. In this example, the least squares regression line is only useful when the stock and company are behaving in the manners that have been plugged in to the modeling equation. As soon as something large happens to disrupt the data or more influential variables appear, you must adjust your least squares regression line and graphs to take in any new evidence that you can find that seems valid.
If you struggled to figure out the mathematics involved in this post, Advanced Math will help you understand least squares regression lines further.
Last Updated September 2016
Learn how to use machine learning algorithms and statistical modeling for clustering, decision trees, etc by using R | By R-Tutorials TrainingExplore Course
Statistical Modeling students also learn
Empower your team. Lead the industry.
Get a subscription to a library of online courses and digital learning tools for your organization with Udemy Business.