What They Don't Tell You About Exploratory Data Analysis in Python!
By Rohit Sharma
Updated on Jul 31, 2025 | 15 min read | 7.14K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 31, 2025 | 15 min read | 7.14K+ views
Share:
Did you know? Exploratory Data Analysis (EDA) was introduced by John Tukey in the 1970s. He encouraged analysts to explore data visually and statistically before building models. This approach helps you uncover patterns and understand data behavior early. It also makes it easier to spot outliers, trends, and unusual values. |
Exploratory Data Analysis (EDA) helps you understand your data before building models. It reveals patterns, outliers, and trends. Python is widely used for EDA because libraries like Pandas and Seaborn make it easy to clean and visualize data.
For example, if you're analyzing customer purchase data, Python can quickly show spending patterns, seasonal spikes, or missing entries. It lets you fix issues early and build smarter models later. It's efficient, flexible, and beginner-friendly.
In this blog, you’ll learn how to explore, visualize, and interpret datasets using Python’s EDA tools. This includes identifying trends, spotting outliers, and preparing data.
Popular Data Science Programs
Exploratory Data Analysis (EDA) is the bridge between raw data and meaningful insight. In Python, it's not just about plotting graphs. It's about asking the right questions, choosing the right tools, and carefully investigating each feature before modeling. Python’s libraries like Pandas, Matplotlib, and Seaborn offer a rich, flexible environment to explore data interactively and efficiently.
When performing EDA, always keep these in mind:
In 2025, professionals who can use data analysis tools to improve business operations will be in high demand. If you're looking to develop relevant data analytics skills, here are some top-rated courses to help you get there:
Let’s walk through a real-world EDA scenario using Python. Imagine you work at an e-commerce company. You’ve been handed a dataset with the following columns:
Your goal is to explore this data to identify trends, issues, and patterns in sales and delivery.
Data Science Courses to upskill
Explore Data Science Courses for Career Progression
Before diving into data exploration, you need to import the necessary libraries and load your CSV file into a DataFrame using Pandas. This gives you a structured view of your raw data.
CSV:
Order_ID,Customer_ID,Order_Date,City,Product_Category,Quantity,Price,Payment_Method,Delivery_Status,Customer_Rating
O1001,C001,2023-01-10,Delhi,Electronics,1,55000,Credit Card,Delivered,4.5
O1002,C002,2023-01-12,Mumbai,Fashion,2,2500,Credit Card,Delivered,4.0
O1003,C003,2023-01-13,Bangalore,Electronics,1,30000,NetBanking,Pending,3.5
O1004,C004,2023-01-14,Chennai,Fashion,3,1200,COD,Delivered,4.8
O1005,C005,2023-01-14,Mumbai,Home Decor,2,4000,Credit Card,Delivered,4.1
O1006,C006,2023-01-15,Delhi,Electronics,1,60000,NetBanking,Cancelled,
O1007,C007,2023-01-15,Bangalore,Fashion,2,1800,COD,Delivered,3.8
O1008,C008,2023-01-16,Chennai,Electronics,1,35000,UPI,Delivered,4.6
O1009,C009,2023-01-17,Mumbai,Home Decor,2,,Credit Card,Delivered,4.2
O1010,C010,2023-01-18,Delhi,Fashion,1,1500,UPI,Pending,3.7
O1011,C011,2023-01-19,Kolkata,Fashion,2,2000,COD,Delivered,4.3
O1012,C012,2023-01-20,Bangalore,Electronics,1,28000,NetBanking,Delivered,4.9
O1013,C013,2023-01-20,Mumbai,Fashion,1,2200,Credit Card,Returned,2.5
O1014,C014,2023-01-21,Chennai,Home Decor,1,3000,UPI,Delivered,4.0
O1015,C015,2023-01-22,Delhi,Electronics,1,,UPI,Delivered,4.4
Python Code:
import pandas as pd
# Load the dataset
orders = pd.read_csv("ecommerce_orders.csv", parse_dates=["Order_Date"])
# Display the first 5 rows
print(orders.head())
What this does?
Output:
Order_ID Customer_ID Order_Date City Product_Category Quantity Price Payment_Method Delivery_Status Customer_Rating
0 O1001 C001 2023-01-10 Delhi Electronics 1 55000.0 Credit Card Delivered 4.5
1 O1002 C002 2023-01-12 Mumbai Fashion 2 2500.0 Credit Card Delivered 4.0
2 O1003 C003 2023-01-13 Bangalore Electronics 1 30000.0 NetBanking Pending 3.5
3 O1004 C004 2023-01-14 Chennai Fashion 3 1200.0 COD Delivered 4.8
4 O1005 C005 2023-01-14 Mumbai Home Decor 2 4000.0 Credit Card Delivered 4.1
Quick Checks:
print(orders.shape) # (15, 10)
print(orders.info()) # To verify datatypes and nulls
print(orders.columns) # Check if column names are correct
Also Read: Data Exploration Basics: A Beginner's Step-by-Step Guide
Clean date formats early to avoid downstream issues like plotting errors or incorrect filtering. Even if you've parsed the dates while loading, it's smart to confirm consistency.
Why this matters? Some records may have mixed formats (like 2023/01/10, 10-01-2023, or 2023-01-10). Even if read_csv() parses them, bad entries may still slip through. Inconsistent dates can mess up sorting, groupby analysis, and time-based filtering.
You already used parse_dates=["Order_Date"] while loading the CSV. Now, double-check and force all dates into the standard format: YYYY-MM-DD.
Python Code:
# Ensure all dates are in proper datetime format
orders["Order_Date"] = pd.to_datetime(orders["Order_Date"], errors="coerce")
# Confirm the format by printing the first few rows
print(orders["Order_Date"].head())
Explanation:
Sample Output:
0 2023-01-10
1 2023-01-12
2 2023-01-13
3 2023-01-14
4 2023-01-14
Name: Order_Date, dtype: datetime64[ns]
Quick Check for Bad Dates:
Now let’s check if any rows were coerced into NaT:
# Identify invalid date rows
invalid_dates = orders[orders["Order_Date"].isna()]
print(invalid_dates)
If you see any rows here, they need manual correction or removal.
Also Read: Top 20+ Best Business Analysis Techniques to Master in 2025
Text columns often contain inconsistent casing, extra spaces, and unexpected typos. Before analysis, you need to clean them. Clean text ensures your groupby operations, filters, and visualizations are accurate and don’t miss values due to subtle mismatches.
Why This Matters? Let’s say the City column contains both "Delhi" and "delhi "—Python will treat them as different entries. This affects aggregation, filtering, and visual analysis. The same applies to columns like Product_Category, Payment_Method, and Delivery_Status.
Columns to Clean in This Dataset:
We’ll remove leading/trailing spaces and convert everything to lowercase for consistency.
Python Code:
# Clean text-based columns
orders["City"] = orders["City"].str.strip().str.lower()
orders["Product_Category"] = orders["Product_Category"].str.strip().str.lower()
orders["Payment_Method"] = orders["Payment_Method"].str.strip().str.lower()
orders["Delivery_Status"] = orders["Delivery_Status"].str.strip().str.lower()
# Handle empty or unknown delivery statuses
orders["Delivery_Status"] = orders["Delivery_Status"].replace("", "unknown")
orders["Delivery_Status"].fillna("unknown", inplace=True)
Explanation:
Before Cleaning Example:
City |
Payment_Method |
Delhi | Cash |
delhi | cash |
Delhi | Cash |
delhi | CASH |
After Cleaning:
print(orders[["City", "Payment_Method"]].drop_duplicates())
City Payment_Method
0 delhi cash
1 mumbai upi
2 chennai credit card
3 bangalore cod
Now, your categorical columns are consistent. This will make analysis smooth and accurate in later steps like grouping or plotting.
If you’re wondering how to extract insights from datasets, the free Excel for Data Analysis Course is a perfect starting point. The certification is an add-on that will enhance your portfolio.
Also Read: Top Data Analytics Tools Every Data Scientist Should Know About
Real-world data is rarely perfect. You’ll often find missing, incorrect, or illogical values that can break your analysis or skew results. In this step, you’ll learn how to identify and fix such issues in your dataset.
What to Look For?
Strategy:
Python Code:
# Convert Price to numeric
orders["Price"] = pd.to_numeric(orders["Price"], errors="coerce")
# Fill missing prices using average price per product
orders["Price"] = orders.groupby("Product_Category")["Price"].transform(
lambda x: x.fillna(x.mean())
)
# Fix negative quantities
orders["Quantity"] = orders["Quantity"].apply(lambda x: abs(x) if x < 0 else x)
# Clean empty or missing delivery/payment fields
orders["Payment_Method"] = orders["Payment_Method"].replace("", "unknown")
orders["Payment_Method"].fillna("unknown", inplace=True)
orders["Delivery_Status"] = orders["Delivery_Status"].replace("", "unknown")
orders["Delivery_Status"].fillna("unknown", inplace=True)
# Cap Customer Ratings to 1-5 range
orders["Customer_Rating"] = orders["Customer_Rating"].clip(lower=1, upper=5)
Explanation:
Before Cleaning:
Order_ID |
Product_Category |
Quantity |
Price |
Payment_Method |
Customer_Rating |
O1001 | laptop | -2 | 55000 | Cash | 4.2 |
O1002 | mobile | 1 | 6.0 | ||
O1003 | tablet | 1 | 30000 | UPI | 0 |
After Cleaning:
Order_ID |
Product_Category |
Quantity |
Price |
Payment_Method |
Customer_Rating |
O1001 | laptop | 2 | 55000 | cash | 4.2 |
O1002 | mobile | 1 | 30000 | unknown | 5.0 |
O1003 | tablet | 1 | 30000 | upi | 1.0 |
Handling missing and invalid values now will save you major debugging time later. Next, we’ll engineer some new features to add more value to your analysis.
Also Read: Understanding Data Science vs Data Analytics: Key Insights
Once your data is clean, it’s time to add new columns that can give deeper insights. These derived features help you segment, analyze, and visualize the data more meaningfully. In this step, we’ll create two valuable columns: Total_Value and Order_Weekday.
Why These Columns?
Adding such fields turns raw data into actionable metrics, especially for trend analysis and dashboards.
Python Code:
# Create total order value
orders["Total_Value"] = orders["Quantity"] * orders["Price"]
# Convert Order_Date to datetime (if not already)
orders["Order_Date"] = pd.to_datetime(orders["Order_Date"], errors="coerce")
# Create new column for day of the week
orders["Order_Weekday"] = orders["Order_Date"].dt.day_name()
Explanation:
Sample Output:
Order_ID |
Quantity |
Price |
Total_Value |
Order_Date |
Order_Weekday |
O1001 | 2 | 55000 | 110000 | 2023-01-10 | Tuesday |
O1002 | 1 | 30000 | 30000 | 2023-01-12 | Thursday |
O1003 | 1 | 30000 | 30000 | 2023-01-10 | Tuesday |
Why It Matters? With Total_Value, you can quickly group and compare by customer, product, or payment method. With Order_Weekday, you can analyze trends like “Do weekend orders differ from weekdays?” These columns set the stage for smarter aggregation, visualization, and forecasting.
Also Read: 33+ Data Analytics Project Ideas to Try in 2025 For Beginners and Professionals
Outliers are data points that deviate significantly from others in the dataset. In an e-commerce context, these could be unusually large orders, suspiciously low prices, or oddly high quantities. Identifying them early helps in fraud detection, marketing analysis, and improving data quality.
Why Detect Outliers?
Let’s use the Interquartile Range (IQR) method to flag high-value orders.
Python Code:
# Step 1: Calculate Q1 and Q3
Q1 = orders["Total_Value"].quantile(0.25)
Q3 = orders["Total_Value"].quantile(0.75)
IQR = Q3 - Q1
# Step 2: Define outlier threshold
upper_bound = Q3 + 1.5 * IQR
# Step 3: Create a new column for outlier flag
orders["Outlier"] = orders["Total_Value"].apply(lambda x: "Yes" if x > upper_bound else "No")
Explanation:
Sample Output:
Order_ID |
Total_Value |
Outlier |
O1001 | 110000 | Yes |
O1002 | 30000 | No |
O1003 | 30000 | No |
Why It Matters? Now you can segment these high-value orders for further review.
For example:
These insights can drive both business strategy and model accuracy.
Also Read: [Infographic] Top 7 Skills Required for Data Analyst
After cleaning and enriching your dataset, it’s time to visualize it. Visual exploration lets you quickly spot trends, outliers, clusters, or seasonality that might be missed in raw numbers. Python’s Matplotlib and Seaborn libraries make it easy and powerful to create interactive, insightful plots.
Why Visuals Matter in EDA?
1. Sales by Product Category
import seaborn as sns
import matplotlib.pyplot as plt
# Group by Product_Category and sum Total_Value
category_sales = orders.groupby("Product_Category")["Total_Value"].sum().reset_index()
# Barplot
plt.figure(figsize=(8, 5))
sns.barplot(data=category_sales, x="Product_Category", y="Total_Value", palette="muted")
plt.title("Total Sales by Product Category")
plt.ylabel("Sales (in ₹)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Output:
2. Order Volume by Weekday
weekday_counts = orders["Order_Weekday"].value_counts().reset_index()
weekday_counts.columns = ["Weekday", "Order_Count"]
# Sort by calendar order
ordered_days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
weekday_counts["Weekday"] = pd.Categorical(weekday_counts["Weekday"], categories=ordered_days, ordered=True)
weekday_counts.sort_values("Weekday", inplace=True)
# Line plot
plt.figure(figsize=(8, 5))
sns.lineplot(data=weekday_counts, x="Weekday", y="Order_Count", marker="o")
plt.title("Order Volume by Weekday")
plt.ylabel("Number of Orders")
plt.grid(True)
plt.tight_layout()
plt.show()
Output:
Subscribe to upGrad's Newsletter
Join thousands of learners who receive useful tips
3. Price Distribution by City
plt.figure(figsize=(10, 6))
sns.boxplot(data=orders, x="City", y="Price", palette="pastel")
plt.title("Price Distribution by City")
plt.ylabel("Price (₹)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Output:
4. Satisfaction by Payment Method
plt.figure(figsize=(8, 5))
sns.violinplot(data=orders, x="Payment_Method", y="Customer_Rating", inner="quartile", palette="Set3")
plt.title("Customer Rating by Payment Method")
plt.ylabel("Rating (Out of 5)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Output:
What You Gain from Visual Exploration? You’re no longer guessing what your data looks like. You’re seeing it clearly.
These plots bring shape to your data story and help you decide:
Also Read: Simple Steps to Becoming a Data Analyst With No Experience
Once your dataset is clean, enriched, and explored, it's time to save it for further use. Whether you're building dashboards in Power BI or training models in Python, saving the processed dataset correctly ensures seamless downstream work.
Why Saving Matters?
Save as CSV: A CSV file is versatile. You can load it into Excel, BI tools, or ML workflows.
orders.to_csv("cleaned_orders.csv", index=False)
This stores the final dataset with all transformations—standardized dates, cleaned text, derived columns, and outlier labels. It’s readable and compatible across platforms.
Save as Pickle (for Python use): If you're continuing with machine learning in Python, a pickle file preserves data types and structures.
orders.to_pickle("cleaned_orders.pkl")
You can later load it instantly:
orders = pd.read_pickle("cleaned_orders.pkl")
This is faster and better suited for Pandas-native workflows.
Save for Dashboards (e.g., SQLite or Excel): Need to connect your data to Power BI, Tableau, or Looker Studio?
Export to Excel:
orders.to_excel("cleaned_orders.xlsx", index=False)
Save to SQLite (if working with SQL-based dashboards):
import sqlite3
conn = sqlite3.connect("orders_db.sqlite")
orders.to_sql("orders_cleaned", conn, if_exists="replace", index=False)
conn.close()
Tip: Name your files clearly and include the date or version number. This makes it easier to track changes and collaborate across teams.
You're now ready to plug this data into BI dashboards, forecasting models, or customer segmentation pipelines/ You don’t need to worry about data inconsistencies or manual prep.
You can get a better understanding of Python integration with upGrad’s Learn Python Libraries: NumPy, Matplotlib & Pandas. Learn how to manipulate data using NumPy, visualize insights with Matplotlib, and analyze datasets with Pandas.
Also Read: Data Quality in Big Data Analytics: Why Is So Important
Next, let’s look at some of the best practices you need to keep in mind when performing exploratory data analysis in Python.
Exploratory Data Analysis (EDA) is where you shape raw data into real insight. But EDA can get messy fast. Datasets are often inconsistent, incomplete, or cluttered with noise. Without structure, you risk wasting time or misreading patterns.
That’s why best practices matter. They help you clean data early, ask the right questions, and focus on useful signals. Following them keeps your analysis reliable, faster, and ready for modeling.
Here are the best practices to keep in mind when performing Exploratory Data Analysis (EDA) in Python:
1. Know the Business Objective First
Understanding why you're doing EDA shapes how you explore your data.
Example: If your goal is to reduce delivery delays, focus on columns like Order_Date, Delivery_Status, and City.
Outcome: You’ll spend less time exploring irrelevant features and more time solving the real problem.
2. Always Clean Your Data Before Visualizing
Dirty data leads to misleading plots and faulty assumptions.
Example: If you skip cleaning Price, you may include missing or negative values in sales totals.
Outcome: Visualizations are accurate and meaningful, helping you uncover true trends.
3. Use Summary Statistics and Visuals Together
Numbers give clarity, but visuals reveal relationships and distributions.
Example: Use df.describe() alongside a boxplot to spot price outliers.
Outcome: You get a full picture of the data, including central tendency, spread, and anomalies.
4. Explore Feature Relationships, Not Just Individual Columns
Look at how columns interact, especially for prediction or segmentation.
Example: Plot Customer_Rating vs. Payment_Method to understand satisfaction patterns by transaction type.
Outcome: You identify patterns that drive your outcome variables.
5. Avoid Overfitting to the EDA Story
Patterns can mislead. What looks like insight might just be coincidence.
Example: Seeing high ratings for COD orders doesn’t mean COD causes satisfaction—it could be location-dependent.
Outcome: You stay cautious, using EDA to ask better questions, not jump to conclusions.
Also Read: Top 5 Best Data Analytics Courses in India to Boost Your Career in 2025
Next, let’s look at how upGrad can help you learn exploratory data analysis in Python.
upGrad’s Exclusive Data Science Webinar for you –
Transformation & Opportunities in Analytics & Insights
Exploratory Data Analysis (EDA) is more than charts and stats. It’s how you understand your data’s story. Python gives you tools like Pandas, Seaborn, and Matplotlib, but using them well takes guided practice. That’s exactly what upGrad offers.
You don’t just read about EDA. You’ll handle messy datasets, find hidden trends, and uncover insights that models alone can’t. Each project helps you ask smarter questions and spot issues early. You’ll learn when to clean, what to visualize, and how to communicate results clearly. With upGrad’s hands-on approach, you build confidence in EDA, one real-world problem at a time.
In addition to the programs covered above, here are some courses that can enhance your learning journey:
If you're unsure where to begin or which area to focus on, upGrad’s expert career counselors can guide you based on your goals. You can also visit a nearby upGrad offline center to explore course options, get hands-on experience, and speak directly with mentors!
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Reference:
https://www.ibm.com/think/topics/exploratory-data-analysis
Skewed data can distort your interpretation of central tendencies and variability. If most of your values cluster on one side, metrics like the mean become misleading. During EDA, use df.skew() and histograms to assess skewness. If a variable like "customer spend" shows right skew (many small values, few large), apply a log or square root transformation. This normalizes the distribution and makes pattern detection more reliable during visualization and modeling.
Data leakage is when information from outside the training dataset influences your model, often unknowingly. During EDA, leakage might appear as unusually high correlation between an input feature and your target. Check derived features like future timestamps, or "Order Delivered Time" when predicting "Order Time." If it seems too predictive, it might be leaking. Use df.corr(), scatter plots, and time-based segmentation to catch suspicious patterns before modeling.
Categorical plots like sns.countplot() can mask important insights if the data is unbalanced. For example, if "Fashion" appears 10,000 times and "Books" appears 50 times, smaller categories become visually insignificant. Also, inconsistent labels like "mobile", "Mobile ", and "MOBILE" split the true count across buckets. Always clean your text data (e.g., using .str.lower().str.strip()) and normalize or scale counts when plotting to avoid wrong conclusions.
In large datasets with 100+ columns, not all features deserve equal time. Start with .isnull().sum() to see where data is missing. Use .describe() to get quick stats on numerical fields. Then, generate a correlation heatmap to see which features relate most to your target. Focus first on high-variance, business-critical, or domain-known features. You can automate ranking with feature importance tools later, but EDA should remain exploratory, not biased toward known outcomes.
Time-based trends are key to understanding cycles, seasonality, or operational issues. Start by converting dates with pd.to_datetime(). Then extract time units—day, week, month—with .dt.day, .dt.month, or .dt.weekday. For example, plotting Total_Sales over Order_Weekday reveals customer behavior patterns. Use sns.lineplot() or groupby().sum() to view spikes, dips, and inconsistencies. Always consider external factors like holidays, weekends, or marketing campaigns that might influence trends.
Yes, at least initially. Outliers can indicate critical business events or data issues. For instance, a ₹10,00,000 purchase might look like a mistake, but could be a VIP customer. Boxplots and scatterplots help you locate these points visually. Use IQR or Z-score methods to flag them. Before removing, investigate—was it a refund, a system glitch, or genuine activity? Document any removals clearly, especially if those points could impact future predictions.
Definitely. Missing values aren't just annoying. They tell a story. If all NetBanking payments lack customer ratings, you might have an integration failure. Or, if certain regions have null delivery times, it could be due to an unrecorded courier service. Use .isnull().sum() along with .groupby() to segment missingness. Check if it’s random or patterned. Missing values tied to categories are often systematic and need fixing at the pipeline level, not just imputation.
Looking at variables in isolation only gives part of the story. For multi-variable exploration, use sns.pairplot() to observe bivariate scatterplots, but only for small datasets. For bigger ones, use correlation matrices, grouped summaries, or 3D plots (e.g., plotly). You can also use Seaborn’s hue parameter to add another dimension—like sns.scatterplot(x='Price', y='Rating', hue='City'). This reveals interactions that may only emerge in combination.
Not all issues scream out as errors. Some hide in plain sight. For instance, object type columns might contain numbers stored as strings, or inconsistent formats like “1,000” vs “1000”. You might have mixed units in the same column—like ₹ and $. Spelling variants like “delhi” and “Delhi” silently split categories. Use .unique() and .value_counts() to find these. Always check for hidden whitespace, null-like strings (e.g., “-”), or zero inflation.
It’s tempting to find meaning in every plot—but not every pattern is meaningful. If you see a relationship, test it across time slices or subgroups. Validate patterns with external knowledge or business logic. Avoid cherry-picking features that look good in one chart. EDA is about generating hypotheses, not confirming them. Keep notes of what you observe and question your assumptions at every step. Never treat visuals as proof.
Summarize key insights in a clear and reproducible format. Use groupby() outputs, pivot tables, and concise charts. For example, create a heatmap of customer satisfaction across cities and product categories. Highlight top correlations, common missing columns, and outliers. Save plots and code in a Jupyter notebook or Markdown doc. This acts as your audit trail and helps data scientists, product teams, or stakeholders understand what’s going into your model and why.
834 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources