Explore Courses
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Birla Institute of Management Technology Birla Institute of Management Technology Post Graduate Diploma in Management (BIMTECH)
  • 24 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Popular
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science & AI (Executive)
  • 12 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
University of MarylandIIIT BangalorePost Graduate Certificate in Data Science & AI (Executive)
  • 8-8.5 Months
upGradupGradData Science Bootcamp with AI
  • 6 months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
OP Jindal Global UniversityOP Jindal Global UniversityMaster of Design in User Experience Design
  • 12 Months
Popular
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Rushford, GenevaRushford Business SchoolDBA Doctorate in Technology (Computer Science)
  • 36 Months
IIIT BangaloreIIIT BangaloreCloud Computing and DevOps Program (Executive)
  • 8 Months
New
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Popular
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
Golden Gate University Golden Gate University Doctor of Business Administration in Digital Leadership
  • 36 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
Popular
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
Bestseller
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
IIIT BangaloreIIIT BangalorePost Graduate Certificate in Machine Learning & Deep Learning (Executive)
  • 8 Months
Bestseller
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in AI and Emerging Technologies (Blended Learning Program)
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
ESGCI, ParisESGCI, ParisDoctorate of Business Administration (DBA) from ESGCI, Paris
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration From Golden Gate University, San Francisco
  • 36 Months
Rushford Business SchoolRushford Business SchoolDoctor of Business Administration from Rushford Business School, Switzerland)
  • 36 Months
Edgewood CollegeEdgewood CollegeDoctorate of Business Administration from Edgewood College
  • 24 Months
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with Concentration in Generative AI
  • 36 Months
Golden Gate University Golden Gate University DBA in Digital Leadership from Golden Gate University, San Francisco
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Deakin Business School and Institute of Management Technology, GhaziabadDeakin Business School and IMT, GhaziabadMBA (Master of Business Administration)
  • 12 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science (Executive)
  • 12 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityO.P.Jindal Global University
  • 12 Months
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (AI/ML)
  • 36 Months
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDBA Specialisation in AI & ML
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
New
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGrad KnowledgeHutupGrad KnowledgeHutAzure Administrator Certification (AZ-104)
  • 24 Hours
KnowledgeHut upGradKnowledgeHut upGradAWS Cloud Practioner Essentials Certification
  • 1 Week
KnowledgeHut upGradKnowledgeHut upGradAzure Data Engineering Training (DP-203)
  • 1 Week
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
Loyola Institute of Business Administration (LIBA)Loyola Institute of Business Administration (LIBA)Executive PG Programme in Human Resource Management
  • 11 Months
Popular
Goa Institute of ManagementGoa Institute of ManagementExecutive PG Program in Healthcare Management
  • 11 Months
IMT GhaziabadIMT GhaziabadAdvanced General Management Program
  • 11 Months
Golden Gate UniversityGolden Gate UniversityProfessional Certificate in Global Business Management
  • 6-8 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
IU, GermanyIU, GermanyMaster of Business Administration (90 ECTS)
  • 18 Months
Bestseller
IU, GermanyIU, GermanyMaster in International Management (120 ECTS)
  • 24 Months
Popular
IU, GermanyIU, GermanyB.Sc. Computer Science (180 ECTS)
  • 36 Months
Clark UniversityClark UniversityMaster of Business Administration
  • 23 Months
New
Golden Gate UniversityGolden Gate UniversityMaster of Business Administration
  • 20 Months
Clark University, USClark University, USMS in Project Management
  • 20 Months
New
Edgewood CollegeEdgewood CollegeMaster of Business Administration
  • 23 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
KnowledgeHut upGradKnowledgeHut upGradBackend Development Bootcamp
  • Self-Paced
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 5 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
upGradupGradUI/UX Bootcamp
  • 3 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
upGradupGradDigital Marketing Accelerator Program
  • 05 Months

Top 6 Techniques Used in Feature Engineering [Machine Learning]

Updated on 27 September, 2022

5.95K+ views
9 min read

Introduction

Feature engineering is one of the most important aspects of any data science project. Feature engineering refers to the techniques used for extracting and refining features from the raw data. Feature engineering techniques are used to create proper input data for the model and to improve the performance of the model.

The models are trained and built on the features that we derive from the raw data to provide the required output. It may happen that the data which we have is not good enough for the model to learn something from it. If we are able to derive the features which find the solution to our underlying problem, it would turn out to be a good representation of the data.  Better is the representation of the data, better will be the fit of the model and better results will be exhibited by the model.

The workflow of any data science project is an iterative process rather than a one-time process. In most data science projects, a base model is created after creating and refining the features from the raw data. Upon obtaining the results of the base model, some existing features can be tweaked, and some new features are also derived from the data to optimize the model results.

Feature Engineering

The techniques used in the feature engineering process may provide the results in the same way for all the algorithms and data sets. Some of the common techniques used in the feature engineering process are as follows:

1. Value Transformation

The values of the features can be transformed into some other metric by using parameters like the logarithmic function, root function, exponential function, etc.  There are some limitations for these functions and may not be used for all the types of data sets. For instance, the root transformation or the logarithmic transformation cannot be applied to the features that contain negative values.

One of the most commonly used functions is the logarithmic function. The logarithmic function can help in reducing the skewness of the data that may be skewed towards one end. The log transformation tends to normalize the data which reduces the effect of the outliers on the performance of the model.

It also helps in reducing the magnitude of the values in a feature. This is useful when we are using some algorithms which consider the features with greater values to be of greater importance than the others.

2. Data Imputation

Data imputation refers to filling up the missing values in a data set with some statistical value. This technique is important as some algorithms do not work on the missing values which either restrict us to use other algorithms or impute these missing values.  It is preferred to use it if the percentage of missing values in a feature is less (around 5 to 10%) else it would lead to more distortion in the distribution of the data. There are different methods to do it for numerical and categorical features.

We can impute the missing values in numerical features with arbitrary values within a specified range or with statistical measures like mean, median, etc. These imputations must be made carefully as the statistical measures are prone to outliers which would rather degrade the performance of the model. For categorical features, we can impute the missing values with an additional category that is missing in the data set or simply impute them as missing if the category is unknown.

The former requires a good sense of domain knowledge to be able to find the correct category while the latter is more of an alternative for generalization. We can also use mode to impute the categorical features. Imputing the data with mode might also lead to over-representation of the most frequent label if the missing values are too high in number.

Join the Artificial Intelligence courses online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career.

3. Categorical Encoding

 One of the requirements in many algorithms is that the input data should be numerical in nature. This turns out to be a constraint for using categorical features in such algorithms. To represent the categorical features as numbers, we need to perform categorical encoding. Some of the methods to convert the categorical features into numbers are as follows:

1. One-hot encoding: – One-hot encoding creates a new feature that takes a value (either 0 or 1) for each label in a categorical feature. This new feature indicates if that label of the category is present for each observation. For instance, assume there are 4 labels in a categorical feature, then upon applying one-hot encoding, it would create 4 Boolean features.

The same amount of information can also be extracted with 3 features as if all the features contain 0, then the value of categorical feature would be the 4th label. The application of this method increases the feature space if there are many categorical features with a high number of labels in the data set.

2. Frequency encoding: – This method calculates the count or the percentage of each label in the categorical feature and maps it against the same label. This method does not extend the feature space of the data set. One drawback of this method is that if the two or more labels have the same count in the data set, it would give the map the same number for all of the labels. This would lead to the loss of crucial information.

3. Ordinal encoding: – Also known as Label encoding, this method maps the distinct values of a categorical feature with a number ranging from 0 to n-1, with n being the distinct number of labels in the feature. This method does not enlarge the feature space of the data set. But it does create an ordinal relationship within the labels in a feature.

4. Handling of Outliers

Outliers are the data points whose values are very different from the rest of the lot. To handle these outliers, we need to detect them first. We can detect them using visualizations like box-plot and scatter-plot in Python, or we can use the interquartile range (IQR). The interquartile range is the difference between the first quarter (25th percentile) and the third quarter (75th percentile).

The values which do not fall in the range of (Q1 – 1.5*IQR) and (Q3 + 1.5*IQR) are termed as outliers. After detecting the outliers, we can handle them by removing them from the data set, applying some transformation, treating them as missing values to impute them using some method, etc.

5. Feature Scaling

Feature scaling is used to change the values of the features and to bring them within a range. It is important to apply this process if we are using algorithms like SVM, Linear regression, KNN, etc that are sensitive to the magnitude of the values. To scale the features, we can perform standardization, normalization, min-max scaling. Normalization rescales the values of a feature range from -1 to 1. It is the ratio of subtraction of each observation and the mean to the subtraction of the maximum and minimum value of that feature. i.e. [X – mean(X)]/[max(X) – min(X)].

In min-max scaling, it uses the minimum value of the feature instead of the mean. This method is very sensitive to the outliers as it only considers the end-values of the feature. Standardization rescales the values of a feature from 0 to 1. It does not normalize the distribution of the data whereas the former method will do it.

6. Handling Date and Time Variables

We come across many variables that indicate the date and time in different formats. We can derive more features from the date like the month, day of the week/month, year, weekend or not, the difference between the dates, etc. This can allow us to extract more insightful information from the data set. From the time features, we can also extract information like hours, minutes, seconds, etc.

One thing that most people miss out on is that all the date and time variables are cyclic features. For example, suppose we need to check which day between Wednesday (3) and Saturday (7) is closer to Sunday (being a 1). Now we know that Saturday is closer but in numerical terms, it will be a Wednesday as the distance between 3 and 1 is less than that of 7 and 1. The same can be applied when the time format is in 24-hour format.

Also Read: Machine Learning Project Ideas & Topics

To tackle this problem, we can express these variables as a representation of sin and cos function. For the ‘minute’ feature, we can apply sin and cos function using NumPy to represent it in cyclic nature as follows:

minute_feature_sin = np.sin(df[‘minute_feature’]*(2*π/60))

minute_feature_cos = np.cos(df[‘minute_feature’]*(2*π/60))

(Note: Dividing by 60 because there are 60 minutes in an hour. If you want to do it for months, divide it by 12 and so on)

By plotting these features on a scatter plot, you will notice that these features exhibit a cyclic relationship between them.

Conclusion

The article focused on the importance of feature engineering alongside citing some common techniques used in the process of feature engineering. It depends on the algorithm and the data at hand to decide on which techniques of all the above listed would provide better insights.

But that’s really a hard catch and not safe to assume as the data sets can be different and the algorithms used for the data can vary as well. The better approach is to follow an incremental approach and keep a track of the models that have been built along with their results rather than performing feature engineering recklessly.

If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s Executive PG Programme in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Frequently Asked Questions (FAQs)

1. What are the cons of using the mean median based data imputation technique?

When using the mean imputation, the links and correlations between variables are not preserved. But imputing the mean maintains the observed data's mean. As a result, even if all of the data were missing at random, the mean estimate would remain unbiased. The variance of the imputed variables is reduced using mean imputation. Mean imputation reduces standard errors, rendering most hypothesis tests and confidence interval calculations inaccurate. Like this, type I errors are made unconsciously.

2. Why is feature extraction required?

Feature extraction is used to locate the smallest and most informative collection of features (distinct patterns) in order to improve the classifier's effectiveness. Feature extraction aids in the reduction of unnecessary data in a data collection so that emphasis is put only on the relevant information and features. Finally, reducing the data makes it easier for the machine to develop the model with less effort, as well as speeds up the learning and generalization processes in the machine learning process. The most important application of feature engineering is in biomedical signal classification, where feature extraction is an important element, since if the features aren't chosen carefully, the classification performance might suffer.

3. Are there any cons of using the feature extraction technique?

Feature extraction has come up with some new features that are not capable of being read or understood by ordinary people. Scalability is another challenge faced during feature extraction. If the datasets are large, some of the feature extraction techniques will not be able to be executed. Complex non-linear feature extraction approaches, in particular, would be impossible to implement. Most techniques rely on some form of approximation to handle the feature selection problem efficiently, which in certain situations is incapable of tackling the precise problem.