calculate urban words vote in spark

Here is the CSV dataset which has the following columns included:

 word_id,word,up_votes,down_votes,author,definition

The dataset has 2.5+ million items.

Now I load the data into spark and check its size.

 scala> val df = spark.read.format("csv").option("inferSchema",true).option("header",true).load("skydrive/urbanwords.csv")
 val df: org.apache.spark.sql.DataFrame = [word_id: int, word: string ... 4 more fields]
   
 scala> df.count()
 val res0: Long = 2580925                                                        
   
 scala> df.rdd.getNumPartitions
 val res1: Int = 5

Then I printed its schema as follows.

 scala> df.printSchema
 root
  |-- word_id: integer (nullable = true)
  |-- word: string (nullable = true)
  |-- up_votes: string (nullable = true)
  |-- down_votes: string (nullable = true)
  |-- author: string (nullable = true)
  |-- definition: string (nullable = true)

Here must have abnormal values in the columns of "up_votes" and "down_votes".

Because I expect them to be "int" types, but the presented types are "string".

So I have to filter out those abnormal values and translate the relevent columns to the correct types.

I use a regex to filter the correct columns and get a new dataframe.

 scala> val df2 = df.filter($"up_votes".rlike("^[0-9]+$")).filter($"down_votes".rlike("^[0-9]+$"))
 val df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [word_id: int, word: string ... 4 more fields]

As you see above, "^[0-9]+$" is just a regex matching only numeric. The filtered dataset is stored in dataframe df2.

And I need to translate the columns in df2 to the correct types as follows.

 scala> val df3 = df2.withColumn("up_votes",$"up_votes".cast("int")).withColumn("down_votes",$"down_votes".cast("int"))
 val df3: org.apache.spark.sql.DataFrame = [word_id: int, word: string ... 4 more fields]
   
 scala> df3.printSchema
 root
  |-- word_id: integer (nullable = true)
  |-- word: string (nullable = true)
  |-- up_votes: integer (nullable = true)
  |-- down_votes: integer (nullable = true)
  |-- author: string (nullable = true)
  |-- definition: string (nullable = true)

Now both "down_votes" and "up_votes" are in the correct "int" types. I can run the wanted statistics then.

For instance, we can give dataset a preview.

 scala> df3.show(3,false)
 +-------+--------+--------+----------+--------+----------------------------------------------------------------+
 |word_id|word    |up_votes|down_votes|author  |definition                                                      |
 +-------+--------+--------+----------+--------+----------------------------------------------------------------+
 |7      |Janky   |296     |255       |dc397b2f|Undesirable; less-than optimum.                                 |
 |8      |slumpin'|16      |37        |dc397b2f|low down and funky, but [knee deep] enough to ride to.          |
 |9      |yayeeyay|19      |27        |dc397b2f|affirmation; suggestion of encouragement, approval, or interest.|
 +-------+--------+--------+----------+--------+----------------------------------------------------------------+
 only showing top 3 rows
   
 scala> df3.count()
 val res50: Long = 2579167    
   
 scala> df3.select("word").distinct().count()
 val res51: Long = 1741732

To get the max up_votes for words:

 scala> df3.groupBy("word").agg(max("up_votes").alias("max_up")).orderBy(desc("max_up")).show()
 +---------+------+                                                              
 |     word|max_up|
 +---------+------+
 |      sex|289002|
 |cartossin|270206|
 |  hipster|182625|
 |   nigger|110245|
 |     slut| 94894|
 |     love| 94664|
 |    woody| 91820|
 |Gate Rape| 79211|
 |    Pussy| 73146|
 |      Emo| 70992|
 |     cunt| 67451|
 |     fuck| 66250|
 |    Penis| 60686|
 |    nigga| 59844|
 |     SWAG| 59732|
 |  Blowjob| 56248|
 |   muslim| 54053|
 | feminist| 53773|
 |   vagina| 53368|
 |      emo| 49492|
 +---------+------+
 only showing top 20 rows

To get the max down_votes for words:

 scala> df3.groupBy("word").agg(max("down_votes").alias("max_down")).orderBy(desc("max_down")).show()
 +--------------+--------+                                                       
 |          word|max_down|
 +--------------+--------+
 |     cartossin|  100377|
 |           sex|   79134|
 |       hipster|   42696|
 |           Sex|   32887|
 |         Penis|   31492|
 |          SWAG|   30924|
 |           Emo|   27893|
 |         Pussy|   27388|
 |         Chode|   27349|
 |          yolo|   25687|
 |        nigger|   25614|
 |           emo|   25112|
 |       Blowjob|   24977|
 |  Donkey Punch|   24133|
 |          love|   23548|
 |FEED THE BIRDS|   23413|
 |  Prison Money|   23003|
 |           FTB|   22753|
 |   Browncoated|   22340|
 |   Geek Jizzed|   22283|
 +--------------+--------+
 only showing top 20 rows

All done. They are fun.

Return to home | Generated on 09/29/22