Explore Courses
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Birla Institute of Management Technology Birla Institute of Management Technology Post Graduate Diploma in Management (BIMTECH)
  • 24 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Popular
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science & AI (Executive)
  • 12 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
University of MarylandIIIT BangalorePost Graduate Certificate in Data Science & AI (Executive)
  • 8-8.5 Months
upGradupGradData Science Bootcamp with AI
  • 6 months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
OP Jindal Global UniversityOP Jindal Global UniversityMaster of Design in User Experience Design
  • 12 Months
Popular
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Rushford, GenevaRushford Business SchoolDBA Doctorate in Technology (Computer Science)
  • 36 Months
IIIT BangaloreIIIT BangaloreCloud Computing and DevOps Program (Executive)
  • 8 Months
New
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Popular
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
Golden Gate University Golden Gate University Doctor of Business Administration in Digital Leadership
  • 36 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
Popular
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
Bestseller
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
IIIT BangaloreIIIT BangalorePost Graduate Certificate in Machine Learning & Deep Learning (Executive)
  • 8 Months
Bestseller
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in AI and Emerging Technologies (Blended Learning Program)
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
ESGCI, ParisESGCI, ParisDoctorate of Business Administration (DBA) from ESGCI, Paris
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration From Golden Gate University, San Francisco
  • 36 Months
Rushford Business SchoolRushford Business SchoolDoctor of Business Administration from Rushford Business School, Switzerland)
  • 36 Months
Edgewood CollegeEdgewood CollegeDoctorate of Business Administration from Edgewood College
  • 24 Months
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with Concentration in Generative AI
  • 36 Months
Golden Gate University Golden Gate University DBA in Digital Leadership from Golden Gate University, San Francisco
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Deakin Business School and Institute of Management Technology, GhaziabadDeakin Business School and IMT, GhaziabadMBA (Master of Business Administration)
  • 12 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science (Executive)
  • 12 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityO.P.Jindal Global University
  • 12 Months
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (AI/ML)
  • 36 Months
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDBA Specialisation in AI & ML
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
New
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGrad KnowledgeHutupGrad KnowledgeHutAzure Administrator Certification (AZ-104)
  • 24 Hours
KnowledgeHut upGradKnowledgeHut upGradAWS Cloud Practioner Essentials Certification
  • 1 Week
KnowledgeHut upGradKnowledgeHut upGradAzure Data Engineering Training (DP-203)
  • 1 Week
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
Loyola Institute of Business Administration (LIBA)Loyola Institute of Business Administration (LIBA)Executive PG Programme in Human Resource Management
  • 11 Months
Popular
Goa Institute of ManagementGoa Institute of ManagementExecutive PG Program in Healthcare Management
  • 11 Months
IMT GhaziabadIMT GhaziabadAdvanced General Management Program
  • 11 Months
Golden Gate UniversityGolden Gate UniversityProfessional Certificate in Global Business Management
  • 6-8 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
IU, GermanyIU, GermanyMaster of Business Administration (90 ECTS)
  • 18 Months
Bestseller
IU, GermanyIU, GermanyMaster in International Management (120 ECTS)
  • 24 Months
Popular
IU, GermanyIU, GermanyB.Sc. Computer Science (180 ECTS)
  • 36 Months
Clark UniversityClark UniversityMaster of Business Administration
  • 23 Months
New
Golden Gate UniversityGolden Gate UniversityMaster of Business Administration
  • 20 Months
Clark University, USClark University, USMS in Project Management
  • 20 Months
New
Edgewood CollegeEdgewood CollegeMaster of Business Administration
  • 23 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
KnowledgeHut upGradKnowledgeHut upGradBackend Development Bootcamp
  • Self-Paced
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 5 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
upGradupGradUI/UX Bootcamp
  • 3 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
upGradupGradDigital Marketing Accelerator Program
  • 05 Months

Introduction to Outliers in Data Mining: Types, Analysis, and Techniques

Updated on 23 November, 2022

9.34K+ views
9 min read

Whenever we talk about data analysis, there comes the outlier. They might indicate certain errors or some novelty. The process of data mining and analysis involves the analysis of data and predicting the information that the data holds. Sometimes the certain object of a dataset deviates from the others. These deviated objects are termed outliers. They are mostly generated due to certain errors in measurement or execution. In 1969, Grubbs gave the first definition of an outlier.

Errors such as computational errors or incorrect entry of an object cause outliers.  The differences of outliers to that of noise are:

  • Whenever some random error occurs in some measured variable or there is variance in the measured variable, then it is termed as noise.
  • Before detecting the outliers present in a dataset, it is advisable to remove the noise.

Outlier Types

Broadly the outliers may be classified into the univariate outliers and the multivariate outliers.

  • When single-dimensional space is considered, the outliers that occur in the feature space are known as univariate outliers. 
  • The outliers that occur in a feature space of n-dimensions are known as multivariate outliers. It is a difficult task of observing the n-dimensional distribution by a human brain. Therefore, a model is trained to observe such a distribution pattern.
     

Based on type of outliers they may be classified into:

  1. Point outliers: They are the single points of data that are located at a point far away from the distribution of data.
  2. Contextual outliers: As the name suggests, these outliers occur within a context such as the signal of background noise in speech recognition. These types of outliers occur if there is an anomaly in the data instance of a context or any specific condition. There are two types of attributes of the objects of data: contextual attributes, and behavioural attributes. The context is defined by the former type whereas the latter type defines the object’s characteristics.
  3. Collective outliers: These types of outliers occur if there is anomalous behaviour of data points collectively. It can identify certain novelty in the data. 

Our learners also read: Free Online Python Course for Beginners

Check out our data science courses to upskill yourself.

Analysis of Outliers

Outliers are mostly discarded when methods of data mining are applied. But, it’s still used in certain applications like fraud detection. This is mainly because the events that rarely occur can store much more interesting facts than the events that occur more regularly.
 

Other applications where outlier detection plays a major role are:

  • Detection of frauds in the insurance sector, credit cards, and the healthcare sector.
  • Fraud detection in telecom.
  • In cybersecurity for detecting any form of intrusion.
  • In the field of medical analysis.
  • Detection of faults in the safety-critical systems.
  • In marketing, outlier analysis helps in identifying the customer’s nature of spending.
  • Any sort of unusual responses that occurs due to certain medical treatments can be analyzed through outlier analysis in data mining.

The process where the anomalous behavior of the outliers is identified in a dataset is known as outlier analysis.  Also, known as “outlier mining”, the process is defined to be an important task of data mining.

Outlier Detection Techniques

Various techniques combined with different approaches are applied for detecting any anomalous behaviour in a dataset. A few techniques used for outlier detection are:

1. Sorting

  • It is one of the easiest ways for detecting outliers in data mining.
  • The method involves sorting the data according to their magnitude in any of the tools used for data manipulation. 
  • Observation of data could then lead towards identifying any objects that have a value of quite a higher range or a lower range. 
  • These objects could be treated as outliers.

2. Data graphing for detecting outliers

  • The technique involves the use of a graph to plot all the data points. This will allow the observer to visualize which data points are actually diverging from the other objects in the dataset.
  • The outliers are observed in an easier way.
  • The types of plots that can be used for detecting outliers in data mining include histogram, scatter plot and box plot.
  • Bulk observation of data points on one side compared to the data points on another side represents the outliers in a histogram.
  • If we consider two numerical values, then their degree of association is understood well through a scatter plot. If there is an observation that is far away from their association degree, then it represents the outlier.

3. Z-Score for detecting outliers

  • The Z-Score is used to identify how much the data points are deviating from the mean of the sample through calculating the standard deviations of the points.  A Gaussian distribution is assumed in this case. 
  • In cases where Gaussian distribution is not applied to describe the data, transformations are applied like scaling it.
  • Libraries of Python like Scikit-Learn and Scipy have in-built functions for the easy implementation of the transformations. It further uses the libraries Numpy and Panda.
  • If a value of Z-score is 2, then it indicates that the object is lying above the mean with a standard deviation of two, while a value of -2 indicates that the observation is deviating from below the mean with a standard deviation of two. 
  • For any point of the set of data, the following expression is used for the calculation of the Z-score. A standard threshold is defined for the Z-score. It is unusual for the value to be far away from the value of zero. A value of the Z-score that is far away from zero with a value +/-3 is usually used to identify the outliers.
  • If a parametric distribution is considered in a feature space of low dimensions, then the Z-score serves to be a powerful method for removing the outliers from a dataset. Methods like Isolation Forests and Dbscan can be used where the distribution is non-parametric.

4. Dbscan 

  • The method is a clustering approach and is referred to as the Density-Based Spatial Clustering of Applications with Noise.
  • Clustering methods seem to be useful for better visualization and the understanding of data. 
  • Dbscan can be used to graphically represent the relationships existing between the features and the trends in the data set.
  • The density-based algorithm of clustering identifies the neighboring objects by density in a sphere of ‘n-dimensional’ having a radius ‘ɛ.’ The cluster identified in a feature space through this method is a set of points connected through ‘density’.
  • The classes of data points as defined by Dbscan are; Core point, Border point, and Outlier.
  • A core point in a neighbourhood is defined as a point that at least has the same number of points or has points much more than ‘MinPts’.
  • A border point in a neighborhood is defined as a point that lies within a cluster and has no points more than ‘MinPts’. However, the point can still be ‘density reached’ by the other points present in the cluster.
  • An outlier is a point that is not present in any cluster and is not ‘density connected’ by other points.  
  • Two properties are to be satisfied when a cluster is defined: the points should be density connected mutually, and a point that is density reachable by any other points of a cluster, then the point will be the part of the cluster.

5. Isolation Forests

  • For detecting a kind of novelties or outliers, this type of method is the most effective.
  • The method is based on the application of binary trees.
  • The basic principle followed by the method of random forest is that the points which are the outliers are few in number and deviates far from the other observations in the data.
  • The algorithm of the method picks up any feature and does a random splitting of the value that lies between the minimum and the maximum range of values. A forest is then built up likewise for all the other observations in the set.
  • Predictions are made by comparing the value of splitting.
  • The instance ‘path length’ is defined as the ‘splittings’ generated by the algorithm.
  • An outlier is defined to have a shorter path length compared to other observations in the dataset. The approaches for outlier analysis in data mining can also be grouped into statistical methods, a supervised method for outlier detection, and the unsupervised method for outlier detection.
  • Statistical methods include the techniques of graphing data, Z-score, etc. for identification of the outliers. When a single outlier is to be detected, it is recommended to use the Grubbs test.
  • Supervised methods involve the use of a training set of data that has instances to identify the classes within the data including the outliers.
  • In an unsupervised method, there are no labelled instances; however, prediction is made based on the assumption that the dataset has a majority of normal instances.

upGrad’s Exclusive Data Science Webinar for you –

Watch our Webinar on How to Build Digital & Data Mindset?

 

 

Conclusion

In data analysis, the majority of the people mostly tend to remove the outliers for predicting wrong information. However, there are scenarios where outlier detection plays an important role like in the detection of fraud. In either way, detecting outliers seem to be important for data mining. Several methods are used for detecting these anomalies in the data as mentioned in the article. 

If you are interested in mastering your knowledge over other aspects of data mining; you can check the course “Executive PG Programme in Data Science” offered by upGrad. Applicable for all entry-level professionals within 21-45 years of age, the online course offers training by top leaders of the industries. The twelve months-long courses will focus on over 60+ industry projects and one-on-one interaction with the industry partners, thereby building a great future ahead of you. If you have any queries, contact our team of assistance for any help.

Frequently Asked Questions (FAQs)

1. Why are outliers caused and what are the ways to handle them?

Outliers are the data values causing some extreme deviations in a data set. These deviations are different from errors. Significant outliers can cause misleading outcomes in the data analysis. They are generally caused by faulty computation or errors in entering values in the dataset.

There are 3 simple methods to handle outliers- univariate method, multivariate method, and Minkowski error. These techniques are used according to the underlying cause of the outliers. Sometimes more than one technique is required to be combined together to deal with the outliers.

Outliers often prove to impact the data analysis results positively as they can provide significant effects on statistical results. So, it totally depends on the type of outlier that whether it should be removed or not.

2. What are the key steps to detect all outliers?

The following are the three key steps to detect all outliers in data mining:

1. The first step is to choose the right model and distribution for each time series. This is important because a time series can be stationary, non-stationary, discrete, etc and the models for each of these types are different.

2. The next step is to identify the seasonal and trend pattern. If it is not found, it is nearly impossible to identify the contextual and collective outliers. For an automated anomaly detection system, the pattern must be identified automatically and not manually.

3. Determine the reaction between different time series and patterns and identify the anomalies accordingly.

3. Differentiate between univariate and multivariate methods?

The univariate method is the simplest method to handle an outlier. It does not overview any relationship since it is a single variate and its main purpose is to analyze the data and determine the pattern associated with it. Mean, median, and mode are examples of patterns found in the univariate data.

On the other hand, the multivariate method is for analyzing three or more variables. It is more precise than the earlier method since, unlike the univariate method, the multivariate method deals with relationships and patterns. Additive Tree, Canonical Correlation Analysis, and Cluster Analysis are some of the ways to perform multivariate analysis.