For big data computing, Apache Spark has two well known API: RDD and Dataframe.

The former is the original API which has been around since Spark was born. The latter is based on Spark's SQL engine, it's relatively new.

For structured data, we most time use dataframe API, b/c it has some advantages such as:

Here I give a comparison on the use of both API.

I have a dataset "person.csv" which has person data included. It has 10 million items.

The data columns include name, sex, born, zip, email, job, salary.

I want to group by job, and calculate the average salary for every job, and print the top 20 results.

This is the implementation in Spark RDD:

 scala> val rdd = sc.textFile("tmp/person.csv")

scala> rdd.map { x => x.split(",") }.
| map{ x => (x(5).toString, x(6).toDouble) }.
| groupByKey.mapValues(x => x.sum/x.size).
| sortBy(-_._2).take(20).
| foreach(println)

(Substance Abuse Counselor,17546.287735849055)
(Veterinarian,17545.0951026888)
(Executive Assistant,17543.571044904747)
(Art Director,17542.721729023466)
(Hairdresser,17539.738195188907)
(Restaurant Cook,17535.075006769563)
(Security Guard,17533.578946842325)
(Mental Health Counselor,17532.784909494825)
(Registered Nurse,17531.777721322254)
(Veterinary Technologist & Technician,17528.951512963966)
(Construction Worker,17528.794206354774)
(High School Teacher,17525.977890446367)
(Financial Analyst,17525.549314807955)
(Dentist,17525.52422232021)
(Physician,17525.395588883104)
(Cashier,17524.30100802222)
(Medical Secretary,17523.694613958614)
(Massage Therapist,17522.79692116623)
(Customer Service Representative,17520.169818680053)
(Preschool Teacher,17519.08829448097)

It takes 13.5s to finish this job.

And this is the implementation in Spark Dataframe:

 scala> val schema="name STRING,sex STRING, born STRING,zip INT,email STRING,job STRING, salary DOUBLE"
schema: String = name STRING,sex STRING, born STRING,zip INT,email STRING,job STRING, salary DOUBLE

scala> val df = spark.read.format("csv").schema(schema).load("tmp/person.csv")
df: org.apache.spark.sql.DataFrame = [name: string, sex: string ... 5 more fields]

scala> df.groupBy("job").agg(avg("salary").alias("avg_salary")).orderBy(desc("avg_salary")).show(false)
+------------------------------------+------------------+
|job |avg_salary |
+------------------------------------+------------------+
|Substance Abuse Counselor |17546.287735849055|
|Veterinarian |17545.0951026888 |
|Executive Assistant |17543.571044904747|
|Art Director |17542.721729023466|
|Hairdresser |17539.738195188907|
|Restaurant Cook |17535.075006769563|
|Security Guard |17533.578946842325|
|Mental Health Counselor |17532.784909494825|
|Registered Nurse |17531.777721322254|
|Veterinary Technologist & Technician|17528.951512963966|
|Construction Worker |17528.794206354774|
|High School Teacher |17525.977890446367|
|Financial Analyst |17525.549314807955|
|Dentist |17525.52422232021 |
|Physician |17525.395588883104|
|Cashier |17524.30100802222 |
|Medical Secretary |17523.694613958614|
|Massage Therapist |17522.79692116623 |
|Customer Service Representative |17520.169818680053|
|Preschool Teacher |17519.08829448097 |
+------------------------------------+------------------+
only showing top 20 rows

It takes 10s to finish the job.

As you see Dataframe API is clearer for understanding, and is faster than RDD.

But RDD has its own specific uses. Such as the ability for non-structured data, and data transformation.

Return to home | Generated on 09/29/22