Exploratory Data Analysis in Python: What You Need to Know?
Updated on Jan 17, 2024 | 9 min read | 6.6k views
Share:
For working professionals
For fresh graduates
More
Updated on Jan 17, 2024 | 9 min read | 6.6k views
Share:
Table of Contents
Exploratory Data Analysis (EDA) is a very common and important practice followed by all data scientists. It is the process of looking at tables and tables of data from different angles in order to understand it fully. Gaining a good understanding of data helps us to clean and summarize it, which then brings out the insights and trends which were otherwise unclear.
EDA has no hard-core set of rules which are to be followed like in ‘data analysis’, for example. People who are new to the field always tend to confuse between the two terms, which are mostly similar but different in their purpose. Unlike EDA, data analysis is more inclined towards the implementation of probabilities and statistical methods to reveal facts and relationships among different variants.
Coming back, there is no right or wrong way to perform EDA. It varies from person to person however, there are some major guidelines commonly followed which are listed below.
We will look at how some of these are implemented using a very famous ‘Home Credit Default Risk’ dataset available on Kaggle here. The data contains information about the loan applicant at the time of applying for the loan. It contains two types of scenarios:
on at least one of the first Y instalments of the loan in our sample,
We’ll be only working on the application data files for the sake of this article.
Related: Python Project Ideas & Topics for Beginners
app_data = pd.read_csv( ‘application_data.csv’ )
app_data.info()
After reading the application data, we use the info() function to get a short overview of the data we’ll be dealing with. The output below informs us that we have around 300000 loan records with 122 variables. Out of these, there are 16 categorical variables and the rest numerical.
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
It is always a good practice to handle and analyse numerical and categorical data separately.
categorical = app_data.select_dtypes(include = object).columns
app_data[categorical].apply(pd.Series.nunique, axis = 0)
Looking only at the categorical features below, we see that most of them only have a few categories which make them easier to analyse using simple plots.
NAME_CONTRACT_TYPE 2
CODE_GENDER 3
FLAG_OWN_CAR 2
FLAG_OWN_REALTY 2
NAME_TYPE_SUITE 7
NAME_INCOME_TYPE 8
NAME_EDUCATION_TYPE 5
NAME_FAMILY_STATUS 6
NAME_HOUSING_TYPE 6
OCCUPATION_TYPE 18
WEEKDAY_APPR_PROCESS_START 7
ORGANIZATION_TYPE 58
FONDKAPREMONT_MODE 4
HOUSETYPE_MODE 3
WALLSMATERIAL_MODE 7
EMERGENCYSTATE_MODE 2
dtype: int64
Now for the numerical features, the describe() method gives us the statistics of our data:
numer= app_data.describe()
numerical= numer.columns
numer
Looking at the entire table it’s evident that:
So now we know which features will have to be analysed further.
Our learners also read: Free Python Course with Certification
upGrad’s Exclusive Data Science Webinar for you –
Transformation & Opportunities in Analytics & Insights
We can make a point plot of all the features having missing values by plotting the % of missing data along Y-axis.
missing = pd.DataFrame( (app_data.isnull().sum()) * 100 / app_data.shape[0]).reset_index()
plt.figure(figsize = (16,5))
ax = sns.pointplot(‘index’, 0, data = missing)
plt.xticks(rotation = 90, fontsize = 7)
plt.title(“Percentage of Missing values”)
plt.ylabel(“PERCENTAGE”)
plt.show()
Many columns have a lot of missing data (30-70%), some have few missing data (13-19%) and many columns also have no missing data at all. It is not really necessary to modify the dataset when you just have to perform EDA. However, going ahead with data pre-processing, we should know how to handle missing values.
For features with less missing values, we can use regression to predict the missing values or fill with the mean of the values present, depending on the feature. And for features with a very high number of missing values, it is better to drop those columns as they give very less insight on analysis.
In this dataset, loan defaulters are identified using the binary variable ‘TARGET’.
100 * app_data[‘TARGET’].value_counts() / len(app_data[‘TARGET’])
0 91.927118
1 8.072882
Name: TARGET, dtype: float64
We see that the data is highly imbalanced with a ratio of 92:8. Most of the loans were paid back on time (target = 0). So whenever there is such a huge imbalance, it is better to take features and compare them with the target variable (targeted analysis) to determine what categories in those features tend to default on the loans more than others.
Below are just a few examples of graphs that can be made using the seaborn library of python and simple user-defined functions.
Also, Check out all trending Python tutorial concepts in 2024.
Males (M) have a higher chance of defaulting compared to females (F), even though the number of female applicants is almost twice as more. So females are more reliable than men for paying back their loans.
Even though most student loans are for their secondary education or higher education, it is the lower secondary education loans that are riskiest for the company followed by secondary.
Also Read: Career in Data Science
Several techniques are essential in exploratory data analysis Python as they help understand and clean the data in addition to identifying relevant features and testing hypotheses about the data. Python libraries bring along various functions and methods for implementing these techniques and thus make EDA Python a powerful tool for data analysis.
The process of building new features from existing ones is known as feature engineering. It is a critical stage in EDA Python since it enables you to extract additional information from your data. Python includes various libraries for feature engineering, such as NumPy, Pandas, and Scikit-learn.
Outliers are data points significantly different from the rest of the data in your dataset. Outsiders can substantially impact your research, so they must be properly recognised and managed. Outlier identification methods in Python include the Z-score, IQR, and isolation forest.
It is a crucial part of EDA as it allows you to identify patterns and tendencies in your statistics. Python has many visualization libraries, which include Matplotlib, Seaborn, and Plotly. These libraries have an intensive set of charts and graphs that you could use to help show your data.
Data preprocessing is cleaning and transforming your data before you start your analysis. It’s a crucial step in EDA because it can greatly impact the results of your analysis. Python provides several libraries for data preprocessing, including Pandas and Scikit-learn.
Hypothesis trying out is a statistical method of determining whether or not population speculation is true. This is a critical step in EDA as it lets you attract logical conclusions from your data. Scipy and Statsmodels are two Python packages for testing hypotheses.
Such kind of an analysis seen above is done vastly in risk analytics in banking and financial services. This way data archives can be used to minimise the risk of losing money while lending to customers. The scope of EDA in all other sectors is endless and it should be used extensively.
If you are curious to learn about data science, check out IIIT-B & upGrad’s Executive PG in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources