PySpark and Pandas DataFrames: Side-by-Side Syntax Comparisons — Part 1
Background
The use of distributed computing is nearly inevitable when the data size is large (for example, >10M rows in an ETL or ML modeling). If you have access to a Spark environment through technologies such as JupyterHub or DataBricks, then PySpark could be a good option when working with large datasets. However, the challenge is that many data scientists are comfortable with Pandas but may not be fully familiar with PySpark dataframes (year ~ 2019). Thus it could become a learning journey for them, often involving Pandas code conversion to PySpark.
There are many differences between PySpark and Pandas and syntax is only one of them. Initially, a data scientist may “think” in Pandas and then try to convert that thought to Pyspark (before directly thinking in PySpark). To help with this journey to PySpark, this article presents a concise side-by-side syntax comparison of commonly used statements.
There are 4 sections below.
- Dataframe creation
- Data Exploration
- Joins, Append, Union
- Where statements
Link to publically shared DataBricks notebook is here (valid until May 2020)
Dataframe Creation
We will create a Pandas and a PySpark dataframe in this section and use those dataframes later in the rest of the sections.
# Pandas
import pandas as pd
df_pandas = pd.DataFrame({'a': [1,2], 'b': [3,4], 'c':[5,6] })# PySpark - Will give 1 row with an array, not 2 rows
df_spark = spark.createDataFrame([{"a": [1,2], "b": [3,4], "c": [5,6]}])# PySpark - Intended 2 rows per column
df_spark = sc.parallelize([[1,3,5], [2,4,6]]).toDF(("a", "b", "c"))# Output:+---+---+---+
| a | b | c |
+---+---+---+
| 1| 3| 5|
| 2| 4| 6|
+---+---+---+
You may get this warning: “inferring schema from dict is deprecated”. If schema or datatype is important for your use case, more information can be found here.
Data Exploration
Number of Rows
# Pandas
len(df_pandas)# PySpark
df_spark.count()
Drop Duplicates
# Pandas or Pyspark
df.drop_duplicates()# PySpark
df.dropDuplicates()
Describe DataFrame
# Pandas
df_pandas.describe()# PySpark
df_spark.describe().show()
Group by: sum, count, etc.
Note the column names in PySpark output (sum(a), etc.)
# Pandas
df.a.sum()
# output = 3# PySpark
df_spark.select("a").groupBy().sum().collect()[0][0]
# output = 3
Note the column names in PySpark output after Group By (sum(a), etc.)
# Pandas
df_pandas.groupby(['a']).sum()# PySpark
df_spark.groupby("a").sum().show()# PySpark output+---+------+------+------+
| a|sum(a)|sum(b)|sum(c)|
+---+------+------+------+
| 1| 1| 3| 5|
| 2| 2| 4| 6|
+---+------+------+------+
Joins, Append and Union
Append = Union in PySpark with a catch
Appending dataframes is different in Pandas and PySpark.
The order of columns is important while appending two PySpark dataframes.
Let’s create a dataframe with a different order of columns
# Note different order
df_spark2 = sc.parallelize([[11,13,15], [12,14,16]]).toDF(("b", "a", "c"))# Append the PySpark DataFrame with different order of columns
display(df_spark.union(df_spark2))# Output:+---+---+---+
| a | b | c |
+---+---+---+
| 1| 3| 5|
| 2| 4| 6|
| 11| 13| 15|
| 12| 14| 16|
+---+---+---+# Intended output which maintains column order. The union will not automatically align the columns.+---+---+---+
| a | b | c |
+---+---+---+
| 1| 3| 5|
| 2| 4| 6|
| 13| 11| 15|
| 14| 12| 16|
+---+---+---+
Join
PySpark Join syntax is similar to Pandas.
df_spark3 = sc.parallelize([[1,13,15], [2,14,16]]).toDF(("a", "d", "e"))df_spark4 = df_spark.join(df_spark3, how='left', on=['a'])# Output:+---+---+---+---+---+
| a| b| c| d| e|
+---+---+---+---+---+
| 1| 3| 5| 13| 15|
| 2| 4| 6| 14| 16|
+---+---+---+---+---+
Where Statements
# Pandas
df_pandas[df_pandas.a == 1]# PySpark
df_spark.filter("a == 1")
df_spark.where("a == 1")
In the next part of this article series, we will discuss more aspects that might be useful while switching from Pandas to PySpark. Please do leave your comments and suggestions which I can address in subsequent posts. Thanks.