In the previous segments, we learnt about data cleaning, feature engineering, big data analytics etc., which brings us to arguably the most important stage in data analysis - prediction. Almost all of the data cleaning, data preparation and analysis is done so as to predict what will happen next. What a user will like, which movie will succeed and so on.
Please note that the independent variable is just another term used for the features. And the dependent variable is that variable whose value is determined by independent variables. Independent variables are the 'factors' that influence your 'dependent variable'. So, you want to understand how changing an independent variable affects the value of the dependent variable.
Let's first understand the three important terms used in predictive analytics - prediction, regression and time series.
In prediction, based on the previous values of the independent variables, you attempt to predict the dependent variable. Regression analysis is one such technique that helps in making predictions.
For example, you may want to make the sales prediction using marketing spends, pricing, promotion, product placement data. You build a model to capture the relationship between all these independent variables and the dependent variable that is the sales. And then you apply this model to the present and the future. You predict what the future sales will be. Here, regression can help you.
In (time series) forecasting, based on the previous value of a variable, you attempt to predict its future values, i.e. given past sales data, you want to predict future sales. You look for patterns in the sales data itself and not on the relationship between sales and the other variables.
Given the huge amount of data that is available, it is humanly impossible to scan through the entire data and make predictions manually. So you train 'machines' to make predictions using this data. Let's have Ujjyaini explain the basics of machine learning in the following video.
In supervised learning, you use a training set to make the algorithm learn, and then apply what it learnt to new, unseen data points. In unsupervised learning, you try to find patterns based on similarities in the data.
Consider you have the dataset which provides the relationship between the transfer spend and final league position of a team in the premier league. You can build a model based on this and tease out a reasonably dependable relationship between transfer spend and final league position. You can apply this relationship to the future and predict the league positions. This would be an example of classification as the final output is limited to twenty natural numbers only.
Consider a dataset having finishing statistics of strikers for the last six years. You do not know what you can extract from the data but upon clustering, you will be able to see a few clusters. One cluster will probably have the best strikers in the world, like an Aubameyang. The cluster other might have strikers who score a lot of penalty goals, this will contain a Kane or Hazard.
You can read more about the above-mentioned topics below.