This is the dataset for IMDB works title. It has 8.5+ million items included.

Download the dataset from here

We then load it into Spark for a fast analysis.

 scala> val df = spark.read.format("csv").option("inferSchema",true).option("header",true).load("skydrive/imdb.csv")
val df: org.apache.spark.sql.DataFrame = [tconst: string, titleType: string ... 7 more fields]

Get its schema and size.

 scala> df.printSchema()
root
|-- tconst: string (nullable = true)
|-- titleType: string (nullable = true)
|-- primaryTitle: string (nullable = true)
|-- originalTitle: string (nullable = true)
|-- isAdult: string (nullable = true)
|-- startYear: string (nullable = true)
|-- endYear: string (nullable = true)
|-- runtimeMinutes: string (nullable = true)
|-- genres: string (nullable = true)

scala> df.count()
val res1: Long = 8549346

scala> df.rdd.getNumPartitions
val res2: Int = 6

Get the sample sets.

 scala> df.show(3)
+---------+---------+--------------------+--------------------+-------+---------+-------+--------------+--------------------+
| tconst|titleType| primaryTitle| originalTitle|isAdult|startYear|endYear|runtimeMinutes| genres|
+---------+---------+--------------------+--------------------+-------+---------+-------+--------------+--------------------+
|tt0000001| short| Carmencita| Carmencita| 0| 1894| \N| 1| Documentary,Short|
|tt0000002| short|Le clown et ses c...|Le clown et ses c...| 0| 1892| \N| 5| Animation,Short|
|tt0000003| short| Pauvre Pierrot| Pauvre Pierrot| 0| 1892| \N| 4|Animation,Comedy,...|
+---------+---------+--------------------+--------------------+-------+---------+-------+--------------+--------------------+
only showing top 3 rows

Now we can check which year had produced the most works by running this query.

 scala> df.filter($"startYear".rlike("^[0-9]+$")).groupBy("startYear").count().orderBy(desc("count")).show()
+---------+------+
|startYear| count|
+---------+------+
| 2018|408237|
| 2017|405951|
| 2019|389812|
| 2016|385868|
| 2015|364535|
| 2020|357613|
| 2021|348330|
| 2014|347493|
| 2013|327611|
| 2012|306525|
| 2011|269609|
| 2010|238265|
| 2009|209788|
| 2008|198017|
| 2007|183607|
| 2006|165042|
| 2005|148430|
| 2004|134854|
| 2003|116679|
| 2002|104397|
+---------+------+
only showing top 20 rows

As the query above, because the year has abnormal values included so we have to use a regex to filter out them.

The types seem not too many:

 scala> df.groupBy("titleType").count().orderBy(desc("count")).show()
+------------+-------+
| titleType| count|
+------------+-------+
| tvEpisode|6378471|
| short| 847948|
| movie| 597375|
| video| 255943|
| tvSeries| 218363|
| tvMovie| 134189|
|tvMiniSeries| 41262|
| tvSpecial| 35370|
| videoGame| 29999|
| tvShort| 10424|
| tvPilot| 2|
+------------+-------+

We can query how many adult works were made.

 scala> df.filter($"isAdult".isin("0","1")).groupBy("isAdult").count().show()
+-------+-------+
|isAdult| count|
+-------+-------+
| 0|8284336|
| 1| 263411|
+-------+-------+

Data is fun. Just try yourself.

Return to home | Generated on 09/29/22