Explore Courses
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Birla Institute of Management Technology Birla Institute of Management Technology Post Graduate Diploma in Management (BIMTECH)
  • 24 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Popular
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science & AI (Executive)
  • 12 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
University of MarylandIIIT BangalorePost Graduate Certificate in Data Science & AI (Executive)
  • 8-8.5 Months
upGradupGradData Science Bootcamp with AI
  • 6 months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
OP Jindal Global UniversityOP Jindal Global UniversityMaster of Design in User Experience Design
  • 12 Months
Popular
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Rushford, GenevaRushford Business SchoolDBA Doctorate in Technology (Computer Science)
  • 36 Months
IIIT BangaloreIIIT BangaloreCloud Computing and DevOps Program (Executive)
  • 8 Months
New
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Popular
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
Golden Gate University Golden Gate University Doctor of Business Administration in Digital Leadership
  • 36 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
Popular
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
Bestseller
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
IIIT BangaloreIIIT BangalorePost Graduate Certificate in Machine Learning & Deep Learning (Executive)
  • 8 Months
Bestseller
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in AI and Emerging Technologies (Blended Learning Program)
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
ESGCI, ParisESGCI, ParisDoctorate of Business Administration (DBA) from ESGCI, Paris
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration From Golden Gate University, San Francisco
  • 36 Months
Rushford Business SchoolRushford Business SchoolDoctor of Business Administration from Rushford Business School, Switzerland)
  • 36 Months
Edgewood CollegeEdgewood CollegeDoctorate of Business Administration from Edgewood College
  • 24 Months
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with Concentration in Generative AI
  • 36 Months
Golden Gate University Golden Gate University DBA in Digital Leadership from Golden Gate University, San Francisco
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Deakin Business School and Institute of Management Technology, GhaziabadDeakin Business School and IMT, GhaziabadMBA (Master of Business Administration)
  • 12 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science (Executive)
  • 12 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityO.P.Jindal Global University
  • 12 Months
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (AI/ML)
  • 36 Months
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDBA Specialisation in AI & ML
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
New
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGrad KnowledgeHutupGrad KnowledgeHutAzure Administrator Certification (AZ-104)
  • 24 Hours
KnowledgeHut upGradKnowledgeHut upGradAWS Cloud Practioner Essentials Certification
  • 1 Week
KnowledgeHut upGradKnowledgeHut upGradAzure Data Engineering Training (DP-203)
  • 1 Week
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
Loyola Institute of Business Administration (LIBA)Loyola Institute of Business Administration (LIBA)Executive PG Programme in Human Resource Management
  • 11 Months
Popular
Goa Institute of ManagementGoa Institute of ManagementExecutive PG Program in Healthcare Management
  • 11 Months
IMT GhaziabadIMT GhaziabadAdvanced General Management Program
  • 11 Months
Golden Gate UniversityGolden Gate UniversityProfessional Certificate in Global Business Management
  • 6-8 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
IU, GermanyIU, GermanyMaster of Business Administration (90 ECTS)
  • 18 Months
Bestseller
IU, GermanyIU, GermanyMaster in International Management (120 ECTS)
  • 24 Months
Popular
IU, GermanyIU, GermanyB.Sc. Computer Science (180 ECTS)
  • 36 Months
Clark UniversityClark UniversityMaster of Business Administration
  • 23 Months
New
Golden Gate UniversityGolden Gate UniversityMaster of Business Administration
  • 20 Months
Clark University, USClark University, USMS in Project Management
  • 20 Months
New
Edgewood CollegeEdgewood CollegeMaster of Business Administration
  • 23 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
KnowledgeHut upGradKnowledgeHut upGradBackend Development Bootcamp
  • Self-Paced
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 5 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
upGradupGradUI/UX Bootcamp
  • 3 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
upGradupGradDigital Marketing Accelerator Program
  • 05 Months

Exploratory Data Analysis in Python: What You Need to Know?

Updated on 17 January, 2024

6.49K+ views
9 min read

Exploratory Data Analysis (EDA) is a very common and important practice followed by all data scientists. It is the process of looking at tables and tables of data from different angles in order to understand it fully. Gaining a good understanding of data helps us to clean and summarize it, which then brings out the insights and trends which were otherwise unclear.

EDA has no hard-core set of rules which are to be followed like in ‘data analysis’, for example. People who are new to the field always tend to confuse between the two terms, which are mostly similar but different in their purpose. Unlike EDA, data analysis is more inclined towards the implementation of probabilities and statistical methods to reveal facts and relationships among different variants.

Coming back, there is no right or wrong way to perform EDA. It varies from person to person however, there are some major guidelines commonly followed which are listed below.

  • Handling missing values: Null values can be seen when all the data may not have been available or recorded during collection.
  • Removing duplicate data: It is important to prevent any overfitting or bias created during training the machine learning algorithm using repeated data records
  • Handling outliers: Outliers are records that drastically differ from the rest of the data and don’t follow the trend. It can arise due to certain exceptions or inaccuracy during data collection
  • Scaling and normalizing: This is only done for numerical data variables. Most of the time the variables greatly differ in their range and scale which makes it difficult to compare them and find correlations.
  • Univariate and Bivariate analysis: Univariate analysis is usually done by seeing how one variable is affecting the target variable. Bivariate analysis is carried out between any 2 variables, it can either be numerical or categorical or both.

We will look at how some of these are implemented using a very famous ‘Home Credit Default Risk’ dataset available on Kaggle here. The data contains information about the loan applicant at the time of applying for the loan. It contains two types of scenarios:

  • The client with payment difficulties: he/she had late payment more than X days

on at least one of the first Y instalments of the loan in our sample,

  • All other cases: All other cases when the payment is paid on time.

We’ll be only working on the application data files for the sake of this article.

Related: Python Project Ideas & Topics for Beginners

Looking at the Data

app_data = pd.read_csv( ‘application_data.csv’ )

app_data.info()

After reading the application data, we use the info() function to get a short overview of the data we’ll be dealing with. The output below informs us that we have around 300000 loan records with 122 variables. Out of these, there are 16 categorical variables and the rest numerical.

<class ‘pandas.core.frame.DataFrame’>

RangeIndex: 307511 entries, 0 to 307510

Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR

dtypes: float64(65), int64(41), object(16)

memory usage: 286.2+ MB

It is always a good practice to handle and analyse numerical and categorical data separately.

categorical = app_data.select_dtypes(include = object).columns

app_data[categorical].apply(pd.Series.nunique, axis = 0)

Looking only at the categorical features below, we see that most of them only have a few categories which make them easier to analyse using simple plots.

NAME_CONTRACT_TYPE             2

CODE_GENDER                    3

FLAG_OWN_CAR                   2

FLAG_OWN_REALTY                2

NAME_TYPE_SUITE                7

NAME_INCOME_TYPE               8

NAME_EDUCATION_TYPE            5

NAME_FAMILY_STATUS             6

NAME_HOUSING_TYPE              6

OCCUPATION_TYPE               18

WEEKDAY_APPR_PROCESS_START     7

ORGANIZATION_TYPE             58

FONDKAPREMONT_MODE             4

HOUSETYPE_MODE                 3

WALLSMATERIAL_MODE             7

EMERGENCYSTATE_MODE            2

dtype: int64

Now for the numerical features, the describe() method gives us the statistics of our data:

numer= app_data.describe()

numerical= numer.columns

numer

Looking at the entire table it’s evident that:

  • days_birth is negative: applicant’s age (in days) relative to the day of application
  • days_employed has outliers (max value is around 100 years) (635243)
  • amt_annuity- mean much smaller than the max value

So now we know which features will have to be analysed further.

Our learners also read: Free Python Course with Certification

upGrad’s Exclusive Data Science Webinar for you –

Transformation & Opportunities in Analytics & Insights

Missing Data

We can make a point plot of all the features having missing values by plotting the % of missing data along Y-axis.

missing = pd.DataFrame( (app_data.isnull().sum()) * 100 / app_data.shape[0]).reset_index()

plt.figure(figsize = (16,5))

ax = sns.pointplot(‘index’, 0, data = missing)

plt.xticks(rotation = 90, fontsize = 7)

plt.title(“Percentage of Missing values”)

plt.ylabel(“PERCENTAGE”)

plt.show()

Many columns have a lot of missing data (30-70%), some have few missing data (13-19%) and many columns also have no missing data at all. It is not really necessary to modify the dataset when you just have to perform EDA. However, going ahead with data pre-processing, we should know how to handle missing values.

For features with less missing values, we can use regression to predict the missing values or fill with the mean of the values present, depending on the feature. And for features with a very high number of missing values, it is better to drop those columns as they give very less insight on analysis. 

Data Imbalance

In this dataset, loan defaulters are identified using the binary variable ‘TARGET’.

100 * app_data[‘TARGET’].value_counts() / len(app_data[‘TARGET’])

0    91.927118

1     8.072882

Name: TARGET, dtype: float64

We see that the data is highly imbalanced with a ratio of 92:8. Most of the loans were paid back on time (target = 0). So whenever there is such a huge imbalance, it is better to take features and compare them with the target variable (targeted analysis) to determine what categories in those features tend to default on the loans more than others.

Below are just a few examples of graphs that can be made using the seaborn library of python and simple user-defined functions.

Also, Check out all trending Python tutorial concepts in 2024.

Gender

Males (M) have a higher chance of defaulting compared to females (F), even though the number of female applicants is almost twice as more. So females are more reliable than men for paying back their loans.

Education Type

Even though most student loans are for their secondary education or higher education, it is the lower secondary education loans that are riskiest for the company followed by secondary.

Also Read: Career in Data Science

Key Techniques Used in Exploratory Data Analysis in Python

Several techniques are essential in exploratory data analysis Python as they help understand and clean the data in addition to identifying relevant features and testing hypotheses about the data. Python libraries bring along various functions and methods for implementing these techniques and thus make EDA Python a powerful tool for data analysis.

Feature Engineering

The process of building new features from existing ones is known as feature engineering. It is a critical stage in EDA Python since it enables you to extract additional information from your data. Python includes various libraries for feature engineering, such as NumPy, Pandas, and Scikit-learn.

Outlier Detection

Outliers are data points significantly different from the rest of the data in your dataset. Outsiders can substantially impact your research, so they must be properly recognised and managed. Outlier identification methods in Python include the Z-score, IQR, and isolation forest.

Data Visualization

It is a crucial part of EDA as it allows you to identify patterns and tendencies in your statistics. Python has many visualization libraries, which include Matplotlib, Seaborn, and Plotly. These libraries have an intensive set of charts and graphs that you could use to help show your data.

Data Preprocessing

Data preprocessing is cleaning and transforming your data before you start your analysis. It’s a crucial step in EDA because it can greatly impact the results of your analysis. Python provides several libraries for data preprocessing, including Pandas and Scikit-learn.

Hypothesis Testing

Hypothesis trying out is a statistical method of determining whether or not population speculation is true. This is a critical step in EDA as it lets you attract logical conclusions from your data. Scipy and Statsmodels are two Python packages for testing hypotheses.

Conclusion

Such kind of an analysis seen above is done vastly in risk analytics in banking and financial services. This way data archives can be used to minimise the risk of losing money while lending to customers. The scope of EDA in all other sectors is endless and it should be used extensively.

If you are curious to learn about data science, check out IIIT-B & upGrad’s Executive PG in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Frequently Asked Questions (FAQs)

1. Why is Exploratory Data Analysis (EDA) needed?

Exploratory Data Analysis is considered to be the initial level when you start modelling your data. This is quite an insightful technique to analyze the best practices for modelling your data. You will be able to extract visual plots, graphs, and reports from the data to get a complete understanding of it.
The EDA involves certain steps to completely analyse the data including deriving the statistical results, finding missing data values, handling the faulty data entries, and finally deducing various plots and graphs.
The primary aim of this analysis is to ensure that the data set you are using is suitable to start applying modelling algorithms. That’s the reason that this is the first step you should be performing on your data before moving to the modelling stage.

2. What are outliers and how to handle them?

Outliers are referred to the anomalies or slight variances in your data. It can happen during the data collection. There are 4 ways in which we can detect an outlier in the data set. These methods are as follows:
1. Boxplot - Boxplot is a method of detecting an outlier where we segregate the data through their quartiles.
2. Scatterplot - A scatter plot displays the data of 2 variables in the form of a collection of points marked on the cartesian plane. The value of one variable represents the horizontal axis (x-ais) and the value of the other variable represents the vertical axis (y-axis).
3. Z-score - While calculating the Z-score, we look for the points that are far away from the centre and consider them as outliers.
4. InterQuartile Range (IQR) - The InterQuartile Range or IQR is the difference between the upper and lower quartiles or 75th and 25th quartile, often referred to as the statistical dispersion.

3. What are the guidelines to perform EDA?

Unlike data analysis, there are no hard and fast rules and regulations to be followed for EDA. One cannot say that this is the right method or that is the wrong method to perform EDA. Beginners are often misunderstood and get confused between EDA and data analysis.
However, there are some guidelines that are commonly practised:
1. Handling missing values
2. Removing duplicate data
3. Handling outliers
4. Scaling and normalizing
5. Univariate and Bivariate analysis