PySpark and Pandas DataFrames: Side-by-Side Syntax Comparisons — Part 1

Published in

Predmatic

3 min readDec 24, 2019

Background

The use of distributed computing is nearly inevitable when the data size is large (for example, >10M rows in an ETL or ML modeling). If you have access to a Spark environment through technologies such as JupyterHub or DataBricks, then PySpark could be a good option when working with large datasets. However, the challenge is that many data scientists are comfortable with Pandas but may not be fully familiar with PySpark dataframes (year ~ 2019). Thus it could become a learning journey for them, often involving Pandas code conversion to PySpark.

There are many differences between PySpark and Pandas and syntax is only one of them. Initially, a data scientist may “think” in Pandas and then try to convert that thought to Pyspark (before directly thinking in PySpark). To help with this journey to PySpark, this article presents a concise side-by-side syntax comparison of commonly used statements.

There are 4 sections below.

Dataframe creation
Data Exploration
Joins, Append, Union
Where statements

Link to publically shared DataBricks notebook is here (valid until May 2020)

Dataframe Creation

We will create a Pandas and a PySpark dataframe in this section and use those dataframes later in the rest of the sections.

# Pandas
import pandas as pd
df_pandas = pd.DataFrame({'a': [1,2], 'b': [3,4], 'c':[5,6] })# PySpark - Will give 1 row with an array, not 2 rows
df_spark = spark.createDataFrame([{"a": [1,2], "b": [3,4], "c": [5,6]}])# PySpark - Intended 2 rows per column
df_spark = sc.parallelize([[1,3,5], [2,4,6]]).toDF(("a", "b", "c"))# Output:+---+---+---+
| a | b | c |
+---+---+---+
|  1|  3|  5|
|  2|  4|  6|
+---+---+---+

You may get this warning: “inferring schema from dict is deprecated”. If schema or datatype is important for your use case, more information can be found here.

Data Exploration

Number of Rows

# Pandas
len(df_pandas)# PySpark
df_spark.count()

Drop Duplicates

# Pandas or Pyspark
df.drop_duplicates()# PySpark
df.dropDuplicates()

Describe DataFrame

# Pandas
df_pandas.describe()# PySpark
df_spark.describe().show()

Group by: sum, count, etc.

Note the column names in PySpark output (sum(a), etc.)

# Pandas
df.a.sum() 
# output = 3# PySpark
df_spark.select("a").groupBy().sum().collect()[0][0] 
# output = 3

Note the column names in PySpark output after Group By (sum(a), etc.)

# Pandas
df_pandas.groupby(['a']).sum()# PySpark
df_spark.groupby("a").sum().show()# PySpark output+---+------+------+------+ 
|  a|sum(a)|sum(b)|sum(c)| 
+---+------+------+------+ 
|  1|     1|     3|     5| 
|  2|     2|     4|     6| 
+---+------+------+------+

Joins, Append and Union

Append = Union in PySpark with a catch

Appending dataframes is different in Pandas and PySpark.

The order of columns is important while appending two PySpark dataframes.

Let’s create a dataframe with a different order of columns

# Note different order
df_spark2 = sc.parallelize([[11,13,15], [12,14,16]]).toDF(("b", "a", "c"))# Append the PySpark DataFrame with different order of columns
display(df_spark.union(df_spark2))# Output:+---+---+---+
| a | b | c |
+---+---+---+
|  1|  3|  5|
|  2|  4|  6|
| 11| 13| 15|
| 12| 14| 16|
+---+---+---+# Intended output which maintains column order. The union will not automatically align the columns.+---+---+---+
| a | b | c |
+---+---+---+
|  1|  3|  5|
|  2|  4|  6|
| 13| 11| 15|
| 14| 12| 16|
+---+---+---+

Join

PySpark Join syntax is similar to Pandas.

df_spark3 = sc.parallelize([[1,13,15], [2,14,16]]).toDF(("a", "d", "e"))df_spark4 = df_spark.join(df_spark3, how='left', on=['a'])# Output:+---+---+---+---+---+ 
|  a|  b|  c|  d|  e| 
+---+---+---+---+---+ 
|  1|  3|  5| 13| 15| 
|  2|  4|  6| 14| 16| 
+---+---+---+---+---+

Where Statements

# Pandas
df_pandas[df_pandas.a == 1]# PySpark
df_spark.filter("a == 1")
df_spark.where("a == 1")

In the next part of this article series, we will discuss more aspects that might be useful while switching from Pandas to PySpark. Please do leave your comments and suggestions which I can address in subsequent posts. Thanks.