Here I provide a dataset which is NYC taxi data in Mar, 2016.

Here is the download link

We can load it into spark as:

 scala> val df = spark.read.format("csv").option("inferSchema",true).option("header",true).load("tmp/taxi.csv")
df: org.apache.spark.sql.DataFrame = [VendorID: int, tpep_pickup_datetime: string ... 17 more fields]

This data is well organized, which means it has the clear headers, without abnormal values in the columns.

We get the data schema and size:

 scala> df.printSchema
root
|-- VendorID: integer (nullable = true)
|-- tpep_pickup_datetime: string (nullable = true)
|-- tpep_dropoff_datetime: string (nullable = true)
|-- passenger_count: integer (nullable = true)
|-- trip_distance: double (nullable = true)
|-- pickup_longitude: double (nullable = true)
|-- pickup_latitude: double (nullable = true)
|-- RatecodeID: integer (nullable = true)
|-- store_and_fwd_flag: string (nullable = true)
|-- dropoff_longitude: double (nullable = true)
|-- dropoff_latitude: double (nullable = true)
|-- payment_type: integer (nullable = true)
|-- fare_amount: double (nullable = true)
|-- extra: double (nullable = true)
|-- mta_tax: double (nullable = true)
|-- tip_amount: double (nullable = true)
|-- tolls_amount: double (nullable = true)
|-- improvement_surcharge: double (nullable = true)
|-- total_amount: double (nullable = true)

scala> df.count
res1: Long = 12210952

scala> df.rdd.getNumPartitions
res2: Int = 15

As you see it has 12 million items, which works pretty well in my 4G ram VPS.

The data items look fun:

 scala> df.select("VendorID","trip_distance","fare_amount","tip_amount","mta_tax","extra","tolls_amount","improvement_surcharge","total_amount").show(50)
+--------+-------------+-----------+----------+-------+-----+------------+---------------------+------------+
|VendorID|trip_distance|fare_amount|tip_amount|mta_tax|extra|tolls_amount|improvement_surcharge|total_amount|
+--------+-------------+-----------+----------+-------+-----+------------+---------------------+------------+
| 1| 2.5| 9.0| 2.05| 0.5| 0.5| 0.0| 0.3| 12.35|
| 1| 2.9| 11.0| 3.05| 0.5| 0.5| 0.0| 0.3| 15.35|
| 2| 19.98| 54.5| 8.0| 0.5| 0.5| 0.0| 0.3| 63.8|
| 2| 10.78| 31.5| 3.78| 0.5| 0.0| 5.54| 0.3| 41.62|
| 2| 30.43| 98.0| 0.0| 0.0| 0.0| 15.5| 0.3| 113.8|
| 2| 5.92| 23.5| 5.06| 0.5| 1.0| 0.0| 0.3| 30.36|
| 2| 5.72| 23.0| 0.0| 0.5| 0.5| 0.0| 0.3| 24.3|
| 1| 6.2| 20.5| 0.0| 0.5| 0.5| 0.0| 0.3| 21.8|
| 1| 0.7| 5.5| 2.0| 0.5| 0.5| 0.0| 0.3| 8.8|
| 2| 7.18| 23.5| 3.2| 0.5| 0.5| 0.0| 0.3| 28.0|
| 2| 0.54| 4.0| 0.0| 0.5| 0.5| 0.0| 0.3| 5.3|
| 1| 1.7| 8.0| 0.0| 0.5| 0.5| 0.0| 0.3| 9.3|
| 1| 1.1| 5.5| 2.2| 0.5| 0.5| 0.0| 0.3| 9.0|
| 2| 2.1| 9.0| 2.06| 0.5| 0.5| 0.0| 0.3| 12.36|
| 2| 8.54| 27.0| 5.66| 0.5| 0.5| 0.0| 0.3| 33.96|
| 2| 2.0| 8.5| 2.0| 0.5| 0.5| 0.0| 0.3| 11.8|
| 1| 3.2| 11.0| 0.0| 0.5| 0.5| 0.0| 0.3| 12.3|
| 2| 1.59| 8.0| 1.86| 0.5| 0.5| 0.0| 0.3| 11.16|
| 2| 16.81| 52.0| 8.0| 0.5| 0.0| 5.54| 0.3| 66.34|
| 1| 0.5| 4.5| 1.15| 0.5| 0.5| 0.0| 0.3| 6.95|
| 1| 0.6| 5.5| 0.0| 0.5| 0.5| 0.0| 0.3| 6.8|
| 1| 9.9| 28.5| 0.0| 0.5| 0.5| 0.0| 0.3| 29.8|
| 2| 0.8| 4.5| 1.16| 0.5| 0.5| 0.0| 0.3| 6.96|
| 2| 0.49| 4.0| 0.0| 0.5| 0.5| 0.0| 0.3| 5.3|
| 2| 5.31| 17.0| 4.58| 0.5| 0.5| 0.0| 0.3| 22.88|
| 1| 3.5| 13.0| 0.0| 0.5| 0.5| 0.0| 0.3| 14.3|
| 2| 9.14| 31.0| 6.46| 0.5| 0.5| 0.0| 0.3| 38.76|
| 2| 2.17| 8.5| 1.96| 0.5| 0.5| 0.0| 0.3| 11.76|
| 2| 18.03| 52.0| 13.2| 0.5| 0.0| 0.0| 0.3| 66.0|
| 1| 1.0| 5.0| 1.0| 0.5| 0.5| 0.0| 0.3| 7.3|
| 2| 11.53| 33.5| 0.0| 0.5| 0.5| 0.0| 0.3| 34.8|
| 2| 4.26| 16.0| 0.0| 0.5| 0.5| 0.0| 0.3| 17.3|
| 2| 2.66| 11.0| 2.46| 0.5| 0.5| 0.0| 0.3| 14.76|
| 1| 1.7| 9.0| 2.05| 0.5| 0.5| 0.0| 0.3| 12.35|
| 1| 2.4| 10.5| 2.35| 0.5| 0.5| 0.0| 0.3| 14.15|
| 1| 1.0| 5.5| 0.0| 0.5| 0.5| 0.0| 0.3| 6.8|
| 1| 3.7| 13.5| 2.95| 0.5| 0.5| 0.0| 0.3| 17.75|
| 1| 1.3| 7.5| 0.0| 0.5| 0.5| 0.0| 0.3| 8.8|
| 1| 9.2| 31.0| 8.05| 0.5| 0.5| 0.0| 0.3| 40.35|
| 2| 0.79| 5.5| 1.36| 0.5| 0.5| 0.0| 0.3| 8.16|
| 2| 0.95| 4.5| 0.0| 0.5| 0.5| 0.0| 0.3| 5.8|
| 1| 6.9| 22.5| 0.0| 0.5| 0.5| 0.0| 0.3| 23.8|
| 2| 4.26| 15.5| 2.0| 0.5| 0.5| 0.0| 0.3| 18.8|
| 2| 3.11| 12.0| 0.0| 0.5| 0.5| 0.0| 0.3| 13.3|
| 1| 3.8| 13.0| 2.85| 0.5| 0.5| 0.0| 0.3| 17.15|
| 1| 2.8| 11.0| 0.0| 0.5| 0.5| 0.0| 0.3| 12.3|
| 2| 1.21| 5.5| 1.0| 0.5| 0.5| 0.0| 0.3| 7.8|
| 2| 1.31| 6.0| 1.46| 0.5| 0.5| 0.0| 0.3| 8.76|
| 1| 2.7| 11.0| 2.45| 0.5| 0.5| 0.0| 0.3| 14.75|
| 2| 5.39| 20.5| 4.36| 0.5| 0.5| 0.0| 0.3| 26.16|
+--------+-------------+-----------+----------+-------+-----+------------+---------------------+------------+
only showing top 50 rows

And we can run a simple aggregating:

 scala> df.groupBy("VendorID").agg(avg("trip_distance").alias("avg_trip"), avg("total_amount").alias("avg_amount")).show
+--------+------------------+------------------+
|VendorID| avg_trip| avg_amount|
+--------+------------------+------------------+
| 1| 9.663680577438551|15.937637880208014|
| 2|3.0078281666926796| 16.14150544542647|
+--------+------------------+------------------+

enjoy it!

Return to home | Generated on 09/29/22