Explore Courses
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Birla Institute of Management Technology Birla Institute of Management Technology Post Graduate Diploma in Management (BIMTECH)
  • 24 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Popular
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science & AI (Executive)
  • 12 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
University of MarylandIIIT BangalorePost Graduate Certificate in Data Science & AI (Executive)
  • 8-8.5 Months
upGradupGradData Science Bootcamp with AI
  • 6 months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
OP Jindal Global UniversityOP Jindal Global UniversityMaster of Design in User Experience Design
  • 12 Months
Popular
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Rushford, GenevaRushford Business SchoolDBA Doctorate in Technology (Computer Science)
  • 36 Months
IIIT BangaloreIIIT BangaloreCloud Computing and DevOps Program (Executive)
  • 8 Months
New
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Popular
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
Golden Gate University Golden Gate University Doctor of Business Administration in Digital Leadership
  • 36 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
Popular
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
Bestseller
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
IIIT BangaloreIIIT BangalorePost Graduate Certificate in Machine Learning & Deep Learning (Executive)
  • 8 Months
Bestseller
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in AI and Emerging Technologies (Blended Learning Program)
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
ESGCI, ParisESGCI, ParisDoctorate of Business Administration (DBA) from ESGCI, Paris
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration From Golden Gate University, San Francisco
  • 36 Months
Rushford Business SchoolRushford Business SchoolDoctor of Business Administration from Rushford Business School, Switzerland)
  • 36 Months
Edgewood CollegeEdgewood CollegeDoctorate of Business Administration from Edgewood College
  • 24 Months
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with Concentration in Generative AI
  • 36 Months
Golden Gate University Golden Gate University DBA in Digital Leadership from Golden Gate University, San Francisco
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Deakin Business School and Institute of Management Technology, GhaziabadDeakin Business School and IMT, GhaziabadMBA (Master of Business Administration)
  • 12 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science (Executive)
  • 12 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityO.P.Jindal Global University
  • 12 Months
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (AI/ML)
  • 36 Months
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDBA Specialisation in AI & ML
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
New
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGrad KnowledgeHutupGrad KnowledgeHutAzure Administrator Certification (AZ-104)
  • 24 Hours
KnowledgeHut upGradKnowledgeHut upGradAWS Cloud Practioner Essentials Certification
  • 1 Week
KnowledgeHut upGradKnowledgeHut upGradAzure Data Engineering Training (DP-203)
  • 1 Week
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
Loyola Institute of Business Administration (LIBA)Loyola Institute of Business Administration (LIBA)Executive PG Programme in Human Resource Management
  • 11 Months
Popular
Goa Institute of ManagementGoa Institute of ManagementExecutive PG Program in Healthcare Management
  • 11 Months
IMT GhaziabadIMT GhaziabadAdvanced General Management Program
  • 11 Months
Golden Gate UniversityGolden Gate UniversityProfessional Certificate in Global Business Management
  • 6-8 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
IU, GermanyIU, GermanyMaster of Business Administration (90 ECTS)
  • 18 Months
Bestseller
IU, GermanyIU, GermanyMaster in International Management (120 ECTS)
  • 24 Months
Popular
IU, GermanyIU, GermanyB.Sc. Computer Science (180 ECTS)
  • 36 Months
Clark UniversityClark UniversityMaster of Business Administration
  • 23 Months
New
Golden Gate UniversityGolden Gate UniversityMaster of Business Administration
  • 20 Months
Clark University, USClark University, USMS in Project Management
  • 20 Months
New
Edgewood CollegeEdgewood CollegeMaster of Business Administration
  • 23 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
KnowledgeHut upGradKnowledgeHut upGradBackend Development Bootcamp
  • Self-Paced
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 5 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
upGradupGradUI/UX Bootcamp
  • 3 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
upGradupGradDigital Marketing Accelerator Program
  • 05 Months

Dataframe in Apache PySpark: Comprehensive Tutorial [with Examples]

Updated on 30 November, 2022

6.29K+ views
12 min read

Today, we are going to learn about the DataFrame in Apache PySpark. Pyspark is one of the top data science tools in 2020. It is named columns of a distributed collection of rows in Apache Spark. It is very similar to the Tables or columns in Excel Sheets and also similar to the relational database’ table. PySpark DataFrame also has similar characteristics of RDD, which are:

Distributed: The nature of DataFrame and RDD is both distributed

Lazy Evaluations: Execution of task is not done if the action is not performed

Nature of Immutable: Another similar characteristic of RDD / DataFrame is that it cannot be changed once it is created. But one can apply the transformation to transform the RDD / DataFrame.

Learn data science to gain edge over your competitors

Advantage of DataFrames

1. This supports many various languages, such as Java, Scala, R, Python, which is useful in terms of API Support. The API support for multiple languages helps in working with many programming languages.

2. A wide range of data sources and formats are supported by DataFrame, which helps a lot to use a different source of data and their format conveniently.

3. One of the best parts about DataFrame is that it can even handle Petabytes of data, which is a remarkable ability to handle such massive data.

4. Apache Spark quickly understands the schema of DataFrame with the observation in Spark DataFrame. Under named columns, the Spark DataFrame’s Observation is organized. In this way, the plan of queries execution is optimized.

5. Massive Volume of Semi-structured and Structured Data can be quickly processed because it is designed to do it DataFrames.

Apache Spark Setup

Apache Spark should be set up in the machine before it can be started to use for DataFrame Operations. Data can be operated with the support of DataFrame as it supports various DataFrame Operations. Here we are going to discuss some common DataFrame’ operations.

The creation of SparkContext is the first step in the programming of Apache. For the execution of operations in a cluster, there is a requirement of SparkContext. How to access a cluster is told by SparkContext. It also shows the Spark about the location to obtain a cluster.

Read: Deep Learning Frameworks

Then the Apache Cluster connection is established. Its creation is already done if one is using Spark Shell. The other way, configuration setting, can be provided, initialized, and imported for creation of the SparkContext.

One can use this code for the creation:

from pyspark import SparkContext

sc = SparkContext()

DataFrame Creation from CSV file

A new library has to be specified in the shell of python so that the CSV file can be read. To do this, the first step is to download the latest version of the Spark-CSV package and do the extraction of the package in the Spark’s Home Directory. After that, we will open the shell of PySpark, and the package has to be included.

$ ./bin/pyspark –packages com.databricks:spark-csv_2.10:1.3.0

Now the DataFrame will be created after the data has been read from the CSV file.

train = sqlContext.load(source=”com.databricks.spark.csv”, path = ‘PATH/train.csv’, header = True,inferSchema = True)

test = sqlContext.load(source=”com.databricks.spark.csv”, path = ‘PATH/test-comb.csv’, header = True,inferSchema = True)

The test CSV files and train CSV files are located in the folder location called PATH. If the Header is there in the file of CSV, then it will show as True. To know the type of data in each column of the data frame, we will use inferSchema = True option. By using inferSchema = True option, Detection of the data type of data frame’s each column will be detected automatically by SQL context. All the columns will be read as a string If we do not set inferSchema to be true.

Read: Python Libraries for Machine Learning

Manipulation of DataFrame 

Now here we are going to see how to manipulate the Data Frame:

  • Know Columns’ Type of Data

printSchema will be used to see the column type and its data type. Now the schema will be printed in tree format by applying the printSchema().

train.printSchema()

Output:

root

 |– User_ID: integer (nullable = true)

 |– Product_ID: string (nullable = true)

 |– Gender: string (nullable = true)

 |– Age: string (nullable = true)

 |– Occupation: integer (nullable = true)

 |– City_Category: string (nullable = true)

 |– Stay_In_Current_City_Years: string (nullable = true)

 |– Marital_Status: integer (nullable = true)

 |– Product_Category_1: integer (nullable = true)

 |– Product_Category_2: integer (nullable = true)

 |– Product_Category_3: integer (nullable = true)

 |– Purchase: integer (nullable = true)

After the reading of file of csv, we can see that we accurately got the type of data or the schema of each column in data frame.

  • Showing first n observation

To see the first n observation, one can use the head operation. Pandas’s head operation is same like that of PySpark’s head operation.

train.head(5)

Output:

[Row(User_ID=1000001, Product_ID=u’P00069042′, Gender=u’F’, Age=u’0-17′, Occupation=10, City_Category=u’A’, Stay_In_Current_City_Years=u’2′, Marital_Status=0, Product_Category_1=3, Product_Category_2=None, Product_Category_3=None, Purchase=8370),

 Row(User_ID=1000001, Product_ID=u’P00248942′, Gender=u’F’, Age=u’0-17′, Occupation=10, City_Category=u’A’, Stay_In_Current_City_Years=u’2′, Marital_Status=0, Product_Category_1=1, Product_Category_2=6, Product_Category_3=14, Purchase=15200),

 Row(User_ID=1000001, Product_ID=u’P00087842′, Gender=u’F’, Age=u’0-17′, Occupation=10, City_Category=u’A’, Stay_In_Current_City_Years=u’2′, Marital_Status=0, Product_Category_1=12, Product_Category_2=None, Product_Category_3=None, Purchase=1422),

 Row(User_ID=1000001, Product_ID=u’P00085442′, Gender=u’F’, Age=u’0-17′, Occupation=10, City_Category=u’A’, Stay_In_Current_City_Years=u’2′, Marital_Status=0, Product_Category_1=12, Product_Category_2=14, Product_Category_3=None, Purchase=1057),

 Row(User_ID=1000002, Product_ID=u’P00285442′, Gender=u’M’, Age=u’55+’, Occupation=16, City_Category=u’C’, Stay_In_Current_City_Years=u’4+’, Marital_Status=0, Product_Category_1=8, Product_Category_2=None, Product_Category_3=None, Purchase=7969)]

Now we will use the show operation to see the result in a better manner because the results will come in the format of row. We can also truncate the result by using the argument truncate = True.

train.show(2,truncate= True)

Output:

+——-+———-+——+—-+———-+————-+————————–+————–+——————+——————+——————+——–+

|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|

+——-+———-+——+—-+———-+————-+————————–+————–+——————+——————+——————+——–+

|1000001| P00069042| F|0-17| 10| A| 2| 0| 3| null| null| 8370|

|1000001| P00248942| F|0-17| 10| A| 2| 0| 1| 6| 14| 15200|

+——-+———-+——+—-+———-+————-+————————–+————–+——————+——————+——————+——–+

only showing top 2 rows

upGrad’s Exclusive Data Science Webinar for you –

Watch our Webinar on How to Build Digital & Data Mindset?

  • Counting of DataFrame’s rows’ number

To count the rows number in the data frame, we can use the operation of the count. Now we will count the number of rows of test files and train files by applying the count operation. 

train.count(),test.count()

Output:

(233598, 550069)

We have 233598, 550069, rows in test & train, respectively.

  • Getting the count of column and name of columns from test and train file

Similar to the operation of the column in DataFrame of pandas, we will use columns operation to get the name of the column. Now first we will print the no. of the column and the name of the column from the test file and train file.

len(train.columns), train.columns

OutPut: 

12 [‘User_ID’, ‘Product_ID’, ‘Gender’, ‘Age’, ‘Occupation’, ‘City_Category’, ‘Stay_In_Current_City_Years’, ‘Marital_Status’, ‘Product_Category_1’, ‘Product_Category_2’, ‘Product_Category_3’, ‘Purchase’]

Now we are doing it for the test file similarly.

len(test.columns), test.columns

Output:

13 [”, ‘User_ID’, ‘Product_ID’, ‘Gender’, ‘Age’, ‘Occupation’, ‘City_Category’, ‘Stay_In_Current_City_Years’, ‘Marital_Status’, ‘Product_Category_1’, ‘Product_Category_2’, ‘Product_Category_3’, ‘Comb’]

After the output above, we can see that there are 12 columns in the training file and 13 columns in the test file. From the above output, we can check that we have 13 columns in the test file and 12 in the training file. Column “Comb” is the only single column in the test file, and there is no “Purchase” not present in the test file. There is one more column in the test file that we can see is not having any name of the column.

  • Getting the summary statistics such as count, max, min, standard deviance, mean in the DataFrame’s numerical columns

In the DataFrame, we will use the operation called describe the operation. We can do the calculation of the numerical column and get a statistical summary by using describe the operation. All the numerical columns will be calculated in the DataFrame, we there is no column name specified in the calculation of summary statistics.

train.describe().show()

Output:

+——-+——————+—————–+——————-+——————+——————+——————+——————+

|summary| User_ID| Occupation| Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3| Purchase|

+——-+——————+—————–+——————-+——————+——————+——————+——————+

| count| 550068| 550068| 550068| 550068| 376430| 166821| 550068|

| mean|1003028.8424013031|8.076706879876669|0.40965298835780306| 5.404270017525106| 9.842329251122386|12.668243206790512| 9263.968712959126|

| stddev|1727.5915855308265|6.522660487341778| 0.4917701263173273|3.9362113692014082| 5.086589648693526| 4.125337631575267|5023.0653938206015|

| min| 1000001| 0| 0| 1| 2| 3| 12|

| max| 1006040| 20| 1| 20| 18| 18| 23961|

+——-+——————+—————–+——————-+——————+——————+——————+——————+

In describe operation, this is what we get when string column name or categorical column name is specified.

train.describe(‘Product_ID’).show()

Output:

+——-+———-+

|summary|Product_ID|

+——-+———-+

| count| 550068|

| mean| null|

| stddev| null|

| min| P00000142|

| max| P0099942|

+——-+———-+

Based on ASCII, the max and min values of calculated. Describe operation is used to work on the String type column. 

  • Selection of DataFrame’s column

We will use the name of the columns in the select operation to select the column. We will mention the name of the column with the separation by using commas. Now we are going to see how the selection of “Age” and “User_ID” from the training file is made.

  • train.select(‘User_ID’,’Age’).show(5)
  •  Output:
  • +——-+—-+
  • |User_ID| Age|
  • +——-+—-+
  • |1000001|0-17|
  • |1000001|0-17|
  • |1000001|0-17|
  • |1000001|0-17|
  • |1000002| 55+|
  • +——-+—-+
  • Finding Distinct product no. in test files and train files

To calculate the DataFrame’s no. of distinct rows, we will use the distinct operation. Now here we are going to apply distinct operation for the calculation of no. of distinct product in test and train file.

train.select(‘Product_ID’).distinct().count(),test.select(‘Product_ID’).distinct().count()

Output:

(3633, 3492)

We have 3492 & 3633 distinct products in test & train file, respectively. Now we know that in the training file, we have more distinct values than the test file as we can learn from the output result. Now we will use subtract operation to find out the Product_ID categories which are not present in the training file but is present in the test file. Same thing one can also do for all features of categorical.

diff_cat_in_train_test=test.select(‘Product_ID’).subtract(train.select(‘Product_ID’))

diff_cat_in_train_test.distinct().count()# For distinct count

Output:

46

So from the above result, we can know that there are 47 various categories, which are not present in the training file but is present in the test file. The data will be skipped or collected from the test file, which is not present in the file of the train.

  • Calculation of categorical columns’ pairwise frequency?

Let us do the calculation of the column’s pairwise frequency in the DataFrame by using the operation can crosstab operation. Now let us calculate the “Gender” and “Age” columns in DataFrame of the train by applying crosstab operation.

train.crosstab(‘Age’, ‘Gender’).show()

Output:

+———-+—–+——+

|Age_Gender| F| M|

+———-+—–+——+

| 0-17| 5083| 10019|

| 46-50|13199| 32502|

| 18-25|24628| 75032|

| 36-45|27170| 82843|

| 55+| 5083| 16421|

| 51-55| 9894| 28607|

| 26-35|50752|168835|

+———-+—–+——+

The distinct value of Gender is the column name, and the different amount of Age is row name, which can be seen in the above result. In the table, the count of the pair will be zero if it has not occurred.

Our learners also read: Free Python Course with Certification

How to get DataFrame with Unique rows?

To find unique rows and not to include duplicate rows, we will use dropDuplicates operation. It will get the Dataframe without any duplicate rows by dropping the duplicate rows of a DataFrame. Please check here to know how the dropDuplicates procedure is performed to get all the unique rows for the columns.

train.select(‘Age’,’Gender’).dropDuplicates().show()

Output:

+—–+——+

| Age|Gender|

+—–+——+

|51-55| F|

|51-55| M|

|26-35| F|

|26-35| M|

|36-45| F|

|36-45| M|

|46-50| F|

|46-50| M|

| 55+| F|

| 55+| M|

|18-25| F|

| 0-17| F|

|18-25| M|

| 0-17| M|

+—–+——+

  • How to drop rows will null value?

If one wants to drop all the rows which have a null value, then we can use the operation called dropna operation. To drop row from the DataFrame, it considers three options.

  • Subset – it is the list of all the names of the columns to considered for the operation of dropping column with null values.
  • Thresh – this helps in dropping the rows with less than thresh non-null values. By default, nothing is specified in this.
  • How – It can be used in two types – all or any. By using any, it will drop the row if any value in the row is null. By using all, it will decrease the row if all the rows’ values are null.

Now here we are going to use all these options one by one to drop the rows which are null by using default options such as subset, thresh, None for how, none, any.

train.dropna().count()

Output:

166821

  • How to fill the DataFrame’s null values with constant no.?

To fill the null values with constant no. We will use fillna operation here. There are two parameters to be considered by fillna operation to fill the null values.

  • subset: Here, one needs to specify the columns to be considered for filling values.
  • value: Here, we can mention the amount to be replaced with what value, which can be any data type such as string, float, int in all the columns.

Here we are going to fill ‘-1’ in place of null values in train DataFrame.

train.fillna(-1).show(2)

Output:

+——-+———-+——+—-+———-+————-+————————–+————–+——————+——————+——————+——–+

|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|

+——-+———-+——+—-+———-+————-+————————–+————–+——————+——————+——————+——–+

|1000001| P00069042| F|0-17| 10| A| 2| 0| 3| -1| -1| 8370|

|1000001| P00248942| F|0-17| 10| A| 2| 0| 1| 6| 14| 15200|

+——-+———-+——+—-+———-+————-+————————–+————–+——————+——————+——————+——–+

only showing top 2 rows

Conclusion

PySpark is gaining momentum in the world of Artificial Intelligence and Machine learning. PySpark is used to solve real-world machine learning problem. You can create RDD from different data source both external and existing and do all types of transforms on it. Hope this article has been informative and was able to give you deep insights on PySpark dataframes.

If you are curious to learn about PySpark, and other data science tools, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Frequently Asked Questions (FAQs)

1. Is PySpark more efficient than Pandas?

Yes, PySpark is quicker than Pandas, and it even outperforms Pandas in a benchmarking test. In basic terms, Pandas does operations on a single machine, whereas PySpark executes operations across several machines. If you're working on a Machine Learning application with a huge dataset, PySpark is the ideal option, as it can execute operations 100 times quicker than Pandas. Because of the JVM, the Scala programming language is 10 times quicker than Python for data analysis and processing. When Python programming code is utilized to make calls to Spark libraries, the results are mediocre.

2. What are some of the drawbacks of using Apache PySpark?

Spark doesn't have its own file management system. Due to the high cost of extra memory required to operate Spark, in-memory computing might be prohibitively costly. When utilising Apache Spark with Hadoop, developers run into difficulties with compact files. Data iterates in batches in Spark, with each iteration being planned and processed independently. In Apache Spark, data is divided into smaller batches at a predetermined time interval. As a result, record-based window criteria will not be supported by Apache. It instead provides time-based window criteria.

3. How are Datasets, DataFrame and RDD different from each other?

RDD is a clustered collection of data items that is dispersed over multiple computers. Data is represented via RDDs, which are a collection of Java or Scala objects. A DataFrame is a collection of data structured into named columns that is spread across many servers. In a relational database, it is conceptually equivalent to a table. Dataset is a dataframe API extension that offers the RDD API's type-safe, object-oriented programming interface capability. A DataFrame is a distributed collection of data, similar to an RDD, and is not mutable. Data is structured into named columns, similar to a table in a relational database, rather than an RDD. When it comes to simple tasks like grouping data, RDD is slower than both Dataframes and Datasets. It has a simple API for performing aggregate tasks. It can aggregate data quicker than RDDs and Datasets.