OutPut:
12 [‘User_ID’, ‘Product_ID’, ‘Gender’, ‘Age’, ‘Occupation’, ‘City_Category’, ‘Stay_In_Current_City_Years’, ‘Marital_Status’, ‘Product_Category_1’, ‘Product_Category_2’, ‘Product_Category_3’, ‘Purchase’]
Now we are doing it for the test file similarly.
len(test.columns), test.columns
Output:
13 [”, ‘User_ID’, ‘Product_ID’, ‘Gender’, ‘Age’, ‘Occupation’, ‘City_Category’, ‘Stay_In_Current_City_Years’, ‘Marital_Status’, ‘Product_Category_1’, ‘Product_Category_2’, ‘Product_Category_3’, ‘Comb’]
After the output above, we can see that there are 12 columns in the training file and 13 columns in the test file. From the above output, we can check that we have 13 columns in the test file and 12 in the training file. Column “Comb” is the only single column in the test file, and there is no “Purchase” not present in the test file. There is one more column in the test file that we can see is not having any name of the column.
- Getting the summary statistics such as count, max, min, standard deviance, mean in the DataFrame’s numerical columns
In the DataFrame, we will use the operation called describe the operation. We can do the calculation of the numerical column and get a statistical summary by using describe the operation. All the numerical columns will be calculated in the DataFrame, we there is no column name specified in the calculation of summary statistics.
train.describe().show()
Output:
+——-+——————+—————–+——————-+——————+——————+——————+——————+
|summary| User_ID| Occupation| Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3| Purchase|
+——-+——————+—————–+——————-+——————+——————+——————+——————+
| count| 550068| 550068| 550068| 550068| 376430| 166821| 550068|
| mean|1003028.8424013031|8.076706879876669|0.40965298835780306| 5.404270017525106| 9.842329251122386|12.668243206790512| 9263.968712959126|
| stddev|1727.5915855308265|6.522660487341778| 0.4917701263173273|3.9362113692014082| 5.086589648693526| 4.125337631575267|5023.0653938206015|
| min| 1000001| 0| 0| 1| 2| 3| 12|
| max| 1006040| 20| 1| 20| 18| 18| 23961|
+——-+——————+—————–+——————-+——————+——————+——————+——————+
In describe operation, this is what we get when string column name or categorical column name is specified.
train.describe(‘Product_ID’).show()
Output:
+——-+———-+
|summary|Product_ID|
+——-+———-+
| count| 550068|
| mean| null|
| stddev| null|
| min| P00000142|
| max| P0099942|
+——-+———-+
Based on ASCII, the max and min values of calculated. Describe operation is used to work on the String type column.
- Selection of DataFrame’s column
We will use the name of the columns in the select operation to select the column. We will mention the name of the column with the separation by using commas. Now we are going to see how the selection of “Age” and “User_ID” from the training file is made.
- train.select(‘User_ID’,’Age’).show(5)
- Output:
- +——-+—-+
- |User_ID| Age|
- +——-+—-+
- |1000001|0-17|
- |1000001|0-17|
- |1000001|0-17|
- |1000001|0-17|
- |1000002| 55+|
- +——-+—-+
- Finding Distinct product no. in test files and train files
To calculate the DataFrame’s no. of distinct rows, we will use the distinct operation. Now here we are going to apply distinct operation for the calculation of no. of distinct product in test and train file.
train.select(‘Product_ID’).distinct().count(),test.select(‘Product_ID’).distinct().count()
Output:
(3633, 3492)
We have 3492 & 3633 distinct products in test & train file, respectively. Now we know that in the training file, we have more distinct values than the test file as we can learn from the output result. Now we will use subtract operation to find out the Product_ID categories which are not present in the training file but is present in the test file. Same thing one can also do for all features of categorical.
diff_cat_in_train_test=test.select(‘Product_ID’).subtract(train.select(‘Product_ID’))
diff_cat_in_train_test.distinct().count()# For distinct count
Output:
46
So from the above result, we can know that there are 47 various categories, which are not present in the training file but is present in the test file. The data will be skipped or collected from the test file, which is not present in the file of the train.
- Calculation of categorical columns’ pairwise frequency?
Let us do the calculation of the column’s pairwise frequency in the DataFrame by using the operation can crosstab operation. Now let us calculate the “Gender” and “Age” columns in DataFrame of the train by applying crosstab operation.
train.crosstab(‘Age’, ‘Gender’).show()
Output:
+———-+—–+——+
|Age_Gender| F| M|
+———-+—–+——+
| 0-17| 5083| 10019|
| 46-50|13199| 32502|
| 18-25|24628| 75032|
| 36-45|27170| 82843|
| 55+| 5083| 16421|
| 51-55| 9894| 28607|
| 26-35|50752|168835|
+———-+—–+——+
The distinct value of Gender is the column name, and the different amount of Age is row name, which can be seen in the above result. In the table, the count of the pair will be zero if it has not occurred.
Our learners also read: Free Python Course with Certification
How to get DataFrame with Unique rows?
To find unique rows and not to include duplicate rows, we will use dropDuplicates operation. It will get the Dataframe without any duplicate rows by dropping the duplicate rows of a DataFrame. Please check here to know how the dropDuplicates procedure is performed to get all the unique rows for the columns.
train.select(‘Age’,’Gender’).dropDuplicates().show()
Output:
+—–+——+
| Age|Gender|
+—–+——+
|51-55| F|
|51-55| M|
|26-35| F|
|26-35| M|
|36-45| F|
|36-45| M|
|46-50| F|
|46-50| M|
| 55+| F|
| 55+| M|
|18-25| F|
| 0-17| F|
|18-25| M|
| 0-17| M|
+—–+——+
- How to drop rows will null value?
If one wants to drop all the rows which have a null value, then we can use the operation called dropna operation. To drop row from the DataFrame, it considers three options.
- Subset – it is the list of all the names of the columns to considered for the operation of dropping column with null values.
- Thresh – this helps in dropping the rows with less than thresh non-null values. By default, nothing is specified in this.
- How – It can be used in two types – all or any. By using any, it will drop the row if any value in the row is null. By using all, it will decrease the row if all the rows’ values are null.
Now here we are going to use all these options one by one to drop the rows which are null by using default options such as subset, thresh, None for how, none, any.
train.dropna().count()
Output:
166821
- How to fill the DataFrame’s null values with constant no.?
To fill the null values with constant no. We will use fillna operation here. There are two parameters to be considered by fillna operation to fill the null values.
- subset: Here, one needs to specify the columns to be considered for filling values.
- value: Here, we can mention the amount to be replaced with what value, which can be any data type such as string, float, int in all the columns.
Here we are going to fill ‘-1’ in place of null values in train DataFrame.
train.fillna(-1).show(2)
Output:
+——-+———-+——+—-+———-+————-+————————–+————–+——————+——————+——————+——–+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|
+——-+———-+——+—-+———-+————-+————————–+————–+——————+——————+——————+——–+
|1000001| P00069042| F|0-17| 10| A| 2| 0| 3| -1| -1| 8370|
|1000001| P00248942| F|0-17| 10| A| 2| 0| 1| 6| 14| 15200|
+——-+———-+——+—-+———-+————-+————————–+————–+——————+——————+——————+——–+
only showing top 2 rows
Conclusion
PySpark is gaining momentum in the world of Artificial Intelligence and Machine learning. PySpark is used to solve real-world machine learning problem. You can create RDD from different data source both external and existing and do all types of transforms on it. Hope this article has been informative and was able to give you deep insights on PySpark dataframes.
If you are curious to learn about PySpark, and other data science tools, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.