statistics for IMDB works title

This is the dataset for IMDB works title. It has 8.5+ million items included.

We then load it into Spark for a fast analysis.

 scala> val df = spark.read.format("csv").option("inferSchema",true).option("header",true).load("skydrive/imdb.csv")
 val df: org.apache.spark.sql.DataFrame = [tconst: string, titleType: string ... 7 more fields]

Get its schema and size.

 scala> df.printSchema()
 root
  |-- tconst: string (nullable = true)
  |-- titleType: string (nullable = true)
  |-- primaryTitle: string (nullable = true)
  |-- originalTitle: string (nullable = true)
  |-- isAdult: string (nullable = true)
  |-- startYear: string (nullable = true)
  |-- endYear: string (nullable = true)
  |-- runtimeMinutes: string (nullable = true)
  |-- genres: string (nullable = true)
   
 scala> df.count()
 val res1: Long = 8549346                                                        
   
 scala> df.rdd.getNumPartitions
 val res2: Int = 6

Get the sample sets.

 scala> df.show(3)
 +---------+---------+--------------------+--------------------+-------+---------+-------+--------------+--------------------+
 |   tconst|titleType|        primaryTitle|       originalTitle|isAdult|startYear|endYear|runtimeMinutes|              genres|
 +---------+---------+--------------------+--------------------+-------+---------+-------+--------------+--------------------+
 |tt0000001|    short|          Carmencita|          Carmencita|      0|     1894|     \N|             1|   Documentary,Short|
 |tt0000002|    short|Le clown et ses c...|Le clown et ses c...|      0|     1892|     \N|             5|     Animation,Short|
 |tt0000003|    short|      Pauvre Pierrot|      Pauvre Pierrot|      0|     1892|     \N|             4|Animation,Comedy,...|
 +---------+---------+--------------------+--------------------+-------+---------+-------+--------------+--------------------+
 only showing top 3 rows

Now we can check which year had produced the most works by running this query.

 scala> df.filter($"startYear".rlike("^[0-9]+$")).groupBy("startYear").count().orderBy(desc("count")).show()
 +---------+------+                                                              
 |startYear| count|
 +---------+------+
 |     2018|408237|
 |     2017|405951|
 |     2019|389812|
 |     2016|385868|
 |     2015|364535|
 |     2020|357613|
 |     2021|348330|
 |     2014|347493|
 |     2013|327611|
 |     2012|306525|
 |     2011|269609|
 |     2010|238265|
 |     2009|209788|
 |     2008|198017|
 |     2007|183607|
 |     2006|165042|
 |     2005|148430|
 |     2004|134854|
 |     2003|116679|
 |     2002|104397|
 +---------+------+
 only showing top 20 rows

As the query above, because the year has abnormal values included so we have to use a regex to filter out them.

The types seem not too many:

 scala> df.groupBy("titleType").count().orderBy(desc("count")).show()
 +------------+-------+                                                          
 |   titleType|  count|
 +------------+-------+
 |   tvEpisode|6378471|
 |       short| 847948|
 |       movie| 597375|
 |       video| 255943|
 |    tvSeries| 218363|
 |     tvMovie| 134189|
 |tvMiniSeries|  41262|
 |   tvSpecial|  35370|
 |   videoGame|  29999|
 |     tvShort|  10424|
 |     tvPilot|      2|
 +------------+-------+

We can query how many adult works were made.

 scala> df.filter($"isAdult".isin("0","1")).groupBy("isAdult").count().show()
 +-------+-------+                                                               
 |isAdult|  count|
 +-------+-------+
 |      0|8284336|
 |      1| 263411|
 +-------+-------+

Data is fun. Just try yourself.

Return to home | Generated on 09/29/22