Previously, we talked about various private data sources. Now, let’s learn how to source public data sets. Public data is available on the internet on various platforms. A lot of data sets are available for direct analysis, whereas some of the data have to be manually extracted and converted into a format that is fit for analysis.
Let's now see how to source various types of public data from the internet.
Data Sourcing Anecdote: Agricultural Commodity Prices
Anand was presented a case by his client, where he was tasked with forecasting commodity prices. Commodity prices, as the name suggests, are the prices of commodities, such as oil, gold, silver, wheat, cotton, coffee, etc., that are traded in global trade markets. Just as an increase in commodity prices is indicative of economic growth, a fall in the same is indicative of a slowdown.
Let's see how Anand sourced the requisite data for analysis.
Optional Exercise: Sourcing Sports Data
If you are interested in sports, you can take a look at the Awesome Public Datasets on GitHub, which contain a directory of sports data from tennis, cricket, football, basketball and other sports. For example, you can find the ball-by-ball data of all the IPL seasons (~600 matches). You may find it interesting to browse through it and do some EDA/modelling as a side project and add it to your resume. A project on IPL could serve as an interesting talking point in any job interview.