PySpark and Pandas DataFrames: Side-by-Side Syntax Comparisons — Part 1

Anshuman Lall
Predmatic
Published in
3 min readDec 24, 2019

--

Background

The use of distributed computing is nearly inevitable when the data size is large (for example, >10M rows in an ETL or ML modeling). If you have access to a Spark environment through technologies such as JupyterHub or DataBricks, then PySpark could be a good option when working with large datasets. However, the challenge is that many data scientists are comfortable with Pandas but may not be fully familiar with PySpark dataframes (year ~ 2019). Thus it could become a learning journey for them, often involving Pandas code conversion to PySpark.

There are many differences between PySpark and Pandas and syntax is only one of them. Initially, a data scientist may “think” in Pandas and then try to convert that thought to Pyspark (before directly thinking in PySpark). To help with this journey to PySpark, this article presents a concise side-by-side syntax comparison of commonly used statements.

There are 4 sections below.

  • Dataframe creation
  • Data Exploration
  • Joins, Append, Union
  • Where statements

Link to publically shared DataBricks notebook is here (valid until May 2020)

Photo by Maxime VALCARCE on Unsplash

Dataframe Creation

We will create a Pandas and a PySpark dataframe in this section and use those dataframes later in the rest of the sections.

# Pandas
import pandas as pd
df_pandas = pd.DataFrame({'a': [1,2], 'b': [3,4], 'c':[5,6] })
# PySpark - Will give 1 row with an array, not 2 rows
df_spark = spark.createDataFrame([{"a": [1,2], "b": [3,4], "c": [5,6]}])
# PySpark - Intended 2 rows per column
df_spark = sc.parallelize([[1,3,5], [2,4,6]]).toDF(("a", "b", "c"))
# Output:+---+---+---+
| a | b | c |
+---+---+---+
| 1| 3| 5|
| 2| 4| 6|
+---+---+---+

You may get this warning: “inferring schema from dict is deprecated”. If schema or datatype is important for your use case, more information can be found here.

Data Exploration

Number of Rows

# Pandas
len(df_pandas)
# PySpark
df_spark.count()

Drop Duplicates

# Pandas or Pyspark
df.drop_duplicates()
# PySpark
df.dropDuplicates()

Describe DataFrame

# Pandas
df_pandas.describe()
# PySpark
df_spark.describe().show()

Group by: sum, count, etc.

Note the column names in PySpark output (sum(a), etc.)

# Pandas
df.a.sum()
# output = 3
# PySpark
df_spark.select("a").groupBy().sum().collect()[0][0]
# output = 3

Note the column names in PySpark output after Group By (sum(a), etc.)

# Pandas
df_pandas.groupby(['a']).sum()
# PySpark
df_spark.groupby("a").sum().show()
# PySpark output+---+------+------+------+
| a|sum(a)|sum(b)|sum(c)|
+---+------+------+------+
| 1| 1| 3| 5|
| 2| 2| 4| 6|
+---+------+------+------+

Joins, Append and Union

Append = Union in PySpark with a catch

Appending dataframes is different in Pandas and PySpark.

The order of columns is important while appending two PySpark dataframes.

Let’s create a dataframe with a different order of columns

# Note different order
df_spark2 = sc.parallelize([[11,13,15], [12,14,16]]).toDF(("b", "a", "c"))
# Append the PySpark DataFrame with different order of columns
display(df_spark.union(df_spark2))
# Output:+---+---+---+
| a | b | c |
+---+---+---+
| 1| 3| 5|
| 2| 4| 6|
| 11| 13| 15|
| 12| 14| 16|
+---+---+---+
# Intended output which maintains column order. The union will not automatically align the columns.+---+---+---+
| a | b | c |
+---+---+---+
| 1| 3| 5|
| 2| 4| 6|
| 13| 11| 15|
| 14| 12| 16|
+---+---+---+

Join

PySpark Join syntax is similar to Pandas.

df_spark3 = sc.parallelize([[1,13,15], [2,14,16]]).toDF(("a", "d", "e"))df_spark4 = df_spark.join(df_spark3, how='left', on=['a'])# Output:+---+---+---+---+---+ 
| a| b| c| d| e|
+---+---+---+---+---+
| 1| 3| 5| 13| 15|
| 2| 4| 6| 14| 16|
+---+---+---+---+---+

Where Statements

# Pandas
df_pandas[df_pandas.a == 1]
# PySpark
df_spark.filter("a == 1")
df_spark.where("a == 1")

In the next part of this article series, we will discuss more aspects that might be useful while switching from Pandas to PySpark. Please do leave your comments and suggestions which I can address in subsequent posts. Thanks.

--

--