View All
View All
View All
View All
View All
View All
View All
View All
View All
View All
View All
View All
View All

Dataframe in Apache PySpark: Comprehensive Tutorial [with Examples]

By Rohit Sharma

Updated on Nov 30, 2022 | 12 min read | 6.4k views

Share:

Today, we are going to learn about the DataFrame in Apache PySpark. Pyspark is one of the top data science tools in 2020. It is named columns of a distributed collection of rows in Apache Spark. It is very similar to the Tables or columns in Excel Sheets and also similar to the relational database’ table. PySpark DataFrame also has similar characteristics of RDD, which are:

Distributed: The nature of DataFrame and RDD is both distributed

Lazy Evaluations: Execution of task is not done if the action is not performed

Nature of Immutable: Another similar characteristic of RDD / DataFrame is that it cannot be changed once it is created. But one can apply the transformation to transform the RDD / DataFrame.

Learn data science to gain edge over your competitors

Advantage of DataFrames

1. This supports many various languages, such as Java, Scala, R, Python, which is useful in terms of API Support. The API support for multiple languages helps in working with many programming languages.

2. A wide range of data sources and formats are supported by DataFrame, which helps a lot to use a different source of data and their format conveniently.

3. One of the best parts about DataFrame is that it can even handle Petabytes of data, which is a remarkable ability to handle such massive data.

4. Apache Spark quickly understands the schema of DataFrame with the observation in Spark DataFrame. Under named columns, the Spark DataFrame’s Observation is organized. In this way, the plan of queries execution is optimized.

5. Massive Volume of Semi-structured and Structured Data can be quickly processed because it is designed to do it DataFrames.

Apache Spark Setup

Apache Spark should be set up in the machine before it can be started to use for DataFrame Operations. Data can be operated with the support of DataFrame as it supports various DataFrame Operations. Here we are going to discuss some common DataFrame’ operations.

The creation of SparkContext is the first step in the programming of Apache. For the execution of operations in a cluster, there is a requirement of SparkContext. How to access a cluster is told by SparkContext. It also shows the Spark about the location to obtain a cluster.

Read: Deep Learning Frameworks

Then the Apache Cluster connection is established. Its creation is already done if one is using Spark Shell. The other way, configuration setting, can be provided, initialized, and imported for creation of the SparkContext.

One can use this code for the creation:

from pyspark import SparkContext

sc = SparkContext()

DataFrame Creation from CSV file

A new library has to be specified in the shell of python so that the CSV file can be read. To do this, the first step is to download the latest version of the Spark-CSV package and do the extraction of the package in the Spark’s Home Directory. After that, we will open the shell of PySpark, and the package has to be included.

$ ./bin/pyspark –packages com.databricks:spark-csv_2.10:1.3.0

Now the DataFrame will be created after the data has been read from the CSV file.

train = sqlContext.load(source=”com.databricks.spark.csv”, path = ‘PATH/train.csv’, header = True,inferSchema = True)

test = sqlContext.load(source=”com.databricks.spark.csv”, path = ‘PATH/test-comb.csv’, header = True,inferSchema = True)

The test CSV files and train CSV files are located in the folder location called PATH. If the Header is there in the file of CSV, then it will show as True. To know the type of data in each column of the data frame, we will use inferSchema = True option. By using inferSchema = True option, Detection of the data type of data frame’s each column will be detected automatically by SQL context. All the columns will be read as a string If we do not set inferSchema to be true.

Read: Python Libraries for Machine Learning

Manipulation of DataFrame 

Now here we are going to see how to manipulate the Data Frame:

  • Know Columns’ Type of Data

printSchema will be used to see the column type and its data type. Now the schema will be printed in tree format by applying the printSchema().

train.printSchema()

Output:

root

 |– User_ID: integer (nullable = true)

 |– Product_ID: string (nullable = true)

 |– Gender: string (nullable = true)

 |– Age: string (nullable = true)

 |– Occupation: integer (nullable = true)

 |– City_Category: string (nullable = true)

 |– Stay_In_Current_City_Years: string (nullable = true)

 |– Marital_Status: integer (nullable = true)

 |– Product_Category_1: integer (nullable = true)

 |– Product_Category_2: integer (nullable = true)

 |– Product_Category_3: integer (nullable = true)

 |– Purchase: integer (nullable = true)

After the reading of file of csv, we can see that we accurately got the type of data or the schema of each column in data frame.

  • Showing first n observation

To see the first n observation, one can use the head operation. Pandas’s head operation is same like that of PySpark’s head operation.

train.head(5)

Output:

[Row(User_ID=1000001, Product_ID=u’P00069042′, Gender=u’F’, Age=u’0-17′, Occupation=10, City_Category=u’A’, Stay_In_Current_City_Years=u’2′, Marital_Status=0, Product_Category_1=3, Product_Category_2=None, Product_Category_3=None, Purchase=8370),

 Row(User_ID=1000001, Product_ID=u’P00248942′, Gender=u’F’, Age=u’0-17′, Occupation=10, City_Category=u’A’, Stay_In_Current_City_Years=u’2′, Marital_Status=0, Product_Category_1=1, Product_Category_2=6, Product_Category_3=14, Purchase=15200),

 Row(User_ID=1000001, Product_ID=u’P00087842′, Gender=u’F’, Age=u’0-17′, Occupation=10, City_Category=u’A’, Stay_In_Current_City_Years=u’2′, Marital_Status=0, Product_Category_1=12, Product_Category_2=None, Product_Category_3=None, Purchase=1422),

 Row(User_ID=1000001, Product_ID=u’P00085442′, Gender=u’F’, Age=u’0-17′, Occupation=10, City_Category=u’A’, Stay_In_Current_City_Years=u’2′, Marital_Status=0, Product_Category_1=12, Product_Category_2=14, Product_Category_3=None, Purchase=1057),

 Row(User_ID=1000002, Product_ID=u’P00285442′, Gender=u’M’, Age=u’55+’, Occupation=16, City_Category=u’C’, Stay_In_Current_City_Years=u’4+’, Marital_Status=0, Product_Category_1=8, Product_Category_2=None, Product_Category_3=None, Purchase=7969)]

Now we will use the show operation to see the result in a better manner because the results will come in the format of row. We can also truncate the result by using the argument truncate = True.

train.show(2,truncate= True)

Output:

+——-+———-+——+—-+———-+————-+————————–+————–+——————+——————+——————+——–+

|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|

+——-+———-+——+—-+———-+————-+————————–+————–+——————+——————+——————+——–+

|1000001| P00069042| F|0-17| 10| A| 2| 0| 3| null| null| 8370|

|1000001| P00248942| F|0-17| 10| A| 2| 0| 1| 6| 14| 15200|

+——-+———-+——+—-+———-+————-+————————–+————–+——————+——————+——————+——–+

only showing top 2 rows

upGrad’s Exclusive Data Science Webinar for you –

Watch our Webinar on How to Build Digital & Data Mindset?

  • Counting of DataFrame’s rows’ number

To count the rows number in the data frame, we can use the operation of the count. Now we will count the number of rows of test files and train files by applying the count operation. 

train.count(),test.count()

Output:

(233598, 550069)

We have 233598, 550069, rows in test & train, respectively.

  • Getting the count of column and name of columns from test and train file

Similar to the operation of the column in DataFrame of pandas, we will use columns operation to get the name of the column. Now first we will print the no. of the column and the name of the column from the test file and train file.

len(train.columns), train.columns

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree18 Months
View Program

Placement Assistance

Certification8-8.5 Months
View Program

OutPut: 

12 [‘User_ID’, ‘Product_ID’, ‘Gender’, ‘Age’, ‘Occupation’, ‘City_Category’, ‘Stay_In_Current_City_Years’, ‘Marital_Status’, ‘Product_Category_1’, ‘Product_Category_2’, ‘Product_Category_3’, ‘Purchase’]

Now we are doing it for the test file similarly.

len(test.columns), test.columns

Output:

13 [”, ‘User_ID’, ‘Product_ID’, ‘Gender’, ‘Age’, ‘Occupation’, ‘City_Category’, ‘Stay_In_Current_City_Years’, ‘Marital_Status’, ‘Product_Category_1’, ‘Product_Category_2’, ‘Product_Category_3’, ‘Comb’]

After the output above, we can see that there are 12 columns in the training file and 13 columns in the test file. From the above output, we can check that we have 13 columns in the test file and 12 in the training file. Column “Comb” is the only single column in the test file, and there is no “Purchase” not present in the test file. There is one more column in the test file that we can see is not having any name of the column.

  • Getting the summary statistics such as count, max, min, standard deviance, mean in the DataFrame’s numerical columns

In the DataFrame, we will use the operation called describe the operation. We can do the calculation of the numerical column and get a statistical summary by using describe the operation. All the numerical columns will be calculated in the DataFrame, we there is no column name specified in the calculation of summary statistics.

train.describe().show()

Output:

+——-+——————+—————–+——————-+——————+——————+——————+——————+

|summary| User_ID| Occupation| Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3| Purchase|

+——-+——————+—————–+——————-+——————+——————+——————+——————+

| count| 550068| 550068| 550068| 550068| 376430| 166821| 550068|

| mean|1003028.8424013031|8.076706879876669|0.40965298835780306| 5.404270017525106| 9.842329251122386|12.668243206790512| 9263.968712959126|

| stddev|1727.5915855308265|6.522660487341778| 0.4917701263173273|3.9362113692014082| 5.086589648693526| 4.125337631575267|5023.0653938206015|

| min| 1000001| 0| 0| 1| 2| 3| 12|

| max| 1006040| 20| 1| 20| 18| 18| 23961|

+——-+——————+—————–+——————-+——————+——————+——————+——————+

In describe operation, this is what we get when string column name or categorical column name is specified.

train.describe(‘Product_ID’).show()

Output:

+——-+———-+

|summary|Product_ID|

+——-+———-+

| count| 550068|

| mean| null|

| stddev| null|

| min| P00000142|

| max| P0099942|

+——-+———-+

Based on ASCII, the max and min values of calculated. Describe operation is used to work on the String type column. 

  • Selection of DataFrame’s column

We will use the name of the columns in the select operation to select the column. We will mention the name of the column with the separation by using commas. Now we are going to see how the selection of “Age” and “User_ID” from the training file is made.

  • train.select(‘User_ID’,’Age’).show(5)
  •  Output:
  • +——-+—-+
  • |User_ID| Age|
  • +——-+—-+
  • |1000001|0-17|
  • |1000001|0-17|
  • |1000001|0-17|
  • |1000001|0-17|
  • |1000002| 55+|
  • +——-+—-+
  • Finding Distinct product no. in test files and train files

To calculate the DataFrame’s no. of distinct rows, we will use the distinct operation. Now here we are going to apply distinct operation for the calculation of no. of distinct product in test and train file.

train.select(‘Product_ID’).distinct().count(),test.select(‘Product_ID’).distinct().count()

Output:

(3633, 3492)

We have 3492 & 3633 distinct products in test & train file, respectively. Now we know that in the training file, we have more distinct values than the test file as we can learn from the output result. Now we will use subtract operation to find out the Product_ID categories which are not present in the training file but is present in the test file. Same thing one can also do for all features of categorical.

diff_cat_in_train_test=test.select(‘Product_ID’).subtract(train.select(‘Product_ID’))

diff_cat_in_train_test.distinct().count()# For distinct count

Output:

46

So from the above result, we can know that there are 47 various categories, which are not present in the training file but is present in the test file. The data will be skipped or collected from the test file, which is not present in the file of the train.

  • Calculation of categorical columns’ pairwise frequency?

Let us do the calculation of the column’s pairwise frequency in the DataFrame by using the operation can crosstab operation. Now let us calculate the “Gender” and “Age” columns in DataFrame of the train by applying crosstab operation.

train.crosstab(‘Age’, ‘Gender’).show()

Output:

+———-+—–+——+

|Age_Gender| F| M|

+———-+—–+——+

| 0-17| 5083| 10019|

| 46-50|13199| 32502|

| 18-25|24628| 75032|

| 36-45|27170| 82843|

| 55+| 5083| 16421|

| 51-55| 9894| 28607|

| 26-35|50752|168835|

+———-+—–+——+

The distinct value of Gender is the column name, and the different amount of Age is row name, which can be seen in the above result. In the table, the count of the pair will be zero if it has not occurred.

Our learners also read: Free Python Course with Certification

How to get DataFrame with Unique rows?

To find unique rows and not to include duplicate rows, we will use dropDuplicates operation. It will get the Dataframe without any duplicate rows by dropping the duplicate rows of a DataFrame. Please check here to know how the dropDuplicates procedure is performed to get all the unique rows for the columns.

train.select(‘Age’,’Gender’).dropDuplicates().show()

Output:

+—–+——+

| Age|Gender|

+—–+——+

|51-55| F|

|51-55| M|

|26-35| F|

|26-35| M|

|36-45| F|

|36-45| M|

|46-50| F|

|46-50| M|

| 55+| F|

| 55+| M|

|18-25| F|

| 0-17| F|

|18-25| M|

| 0-17| M|

+—–+——+

  • How to drop rows will null value?

If one wants to drop all the rows which have a null value, then we can use the operation called dropna operation. To drop row from the DataFrame, it considers three options.

  • Subset – it is the list of all the names of the columns to considered for the operation of dropping column with null values.
  • Thresh – this helps in dropping the rows with less than thresh non-null values. By default, nothing is specified in this.
  • How – It can be used in two types – all or any. By using any, it will drop the row if any value in the row is null. By using all, it will decrease the row if all the rows’ values are null.

Now here we are going to use all these options one by one to drop the rows which are null by using default options such as subset, thresh, None for how, none, any.

train.dropna().count()

Output:

166821

  • How to fill the DataFrame’s null values with constant no.?

To fill the null values with constant no. We will use fillna operation here. There are two parameters to be considered by fillna operation to fill the null values.

  • subset: Here, one needs to specify the columns to be considered for filling values.
  • value: Here, we can mention the amount to be replaced with what value, which can be any data type such as string, float, int in all the columns.

Here we are going to fill ‘-1’ in place of null values in train DataFrame.

train.fillna(-1).show(2)

Output:

+——-+———-+——+—-+———-+————-+————————–+————–+——————+——————+——————+——–+

|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|

+——-+———-+——+—-+———-+————-+————————–+————–+——————+——————+——————+——–+

|1000001| P00069042| F|0-17| 10| A| 2| 0| 3| -1| -1| 8370|

|1000001| P00248942| F|0-17| 10| A| 2| 0| 1| 6| 14| 15200|

+——-+———-+——+—-+———-+————-+————————–+————–+——————+——————+——————+——–+

only showing top 2 rows

Conclusion

PySpark is gaining momentum in the world of Artificial Intelligence and Machine learning. PySpark is used to solve real-world machine learning problem. You can create RDD from different data source both external and existing and do all types of transforms on it. Hope this article has been informative and was able to give you deep insights on PySpark dataframes.

If you are curious to learn about PySpark, and other data science tools, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Frequently Asked Questions (FAQs)

1. Is PySpark more efficient than Pandas?

2. What are some of the drawbacks of using Apache PySpark?

3. How are Datasets, DataFrame and RDD different from each other?

Rohit Sharma

694 articles published

Get Free Consultation

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months

View Program
Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

18 Months

View Program
upGrad Logo

Certification

3 Months

View Program