Here is the CSV dataset which has the following columns included:

 word_id,word,up_votes,down_votes,author,definition

The dataset has 2.5+ million items.

Now I load the data into spark and check its size.

 scala> val df = spark.read.format("csv").option("inferSchema",true).option("header",true).load("skydrive/urbanwords.csv")
val df: org.apache.spark.sql.DataFrame = [word_id: int, word: string ... 4 more fields]

scala> df.count()
val res0: Long = 2580925

scala> df.rdd.getNumPartitions
val res1: Int = 5

Then I printed its schema as follows.

 scala> df.printSchema
root
|-- word_id: integer (nullable = true)
|-- word: string (nullable = true)
|-- up_votes: string (nullable = true)
|-- down_votes: string (nullable = true)
|-- author: string (nullable = true)
|-- definition: string (nullable = true)

Here must have abnormal values in the columns of "up_votes" and "down_votes".

Because I expect them to be "int" types, but the presented types are "string".

So I have to filter out those abnormal values and translate the relevent columns to the correct types.

I use a regex to filter the correct columns and get a new dataframe.

 scala> val df2 = df.filter($"up_votes".rlike("^[0-9]+$")).filter($"down_votes".rlike("^[0-9]+$"))
val df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [word_id: int, word: string ... 4 more fields]

As you see above, "^[0-9]+$" is just a regex matching only numeric. The filtered dataset is stored in dataframe df2.

And I need to translate the columns in df2 to the correct types as follows.

 scala> val df3 = df2.withColumn("up_votes",$"up_votes".cast("int")).withColumn("down_votes",$"down_votes".cast("int"))
val df3: org.apache.spark.sql.DataFrame = [word_id: int, word: string ... 4 more fields]

scala> df3.printSchema
root
|-- word_id: integer (nullable = true)
|-- word: string (nullable = true)
|-- up_votes: integer (nullable = true)
|-- down_votes: integer (nullable = true)
|-- author: string (nullable = true)
|-- definition: string (nullable = true)

Now both "down_votes" and "up_votes" are in the correct "int" types. I can run the wanted statistics then.

For instance, we can give dataset a preview.

 scala> df3.show(3,false)
+-------+--------+--------+----------+--------+----------------------------------------------------------------+
|word_id|word |up_votes|down_votes|author |definition |
+-------+--------+--------+----------+--------+----------------------------------------------------------------+
|7 |Janky |296 |255 |dc397b2f|Undesirable; less-than optimum. |
|8 |slumpin'|16 |37 |dc397b2f|low down and funky, but [knee deep] enough to ride to. |
|9 |yayeeyay|19 |27 |dc397b2f|affirmation; suggestion of encouragement, approval, or interest.|
+-------+--------+--------+----------+--------+----------------------------------------------------------------+
only showing top 3 rows

scala> df3.count()
val res50: Long = 2579167

scala> df3.select("word").distinct().count()
val res51: Long = 1741732

To get the max up_votes for words:

 scala> df3.groupBy("word").agg(max("up_votes").alias("max_up")).orderBy(desc("max_up")).show()
+---------+------+
| word|max_up|
+---------+------+
| sex|289002|
|cartossin|270206|
| hipster|182625|
| nigger|110245|
| slut| 94894|
| love| 94664|
| woody| 91820|
|Gate Rape| 79211|
| Pussy| 73146|
| Emo| 70992|
| cunt| 67451|
| fuck| 66250|
| Penis| 60686|
| nigga| 59844|
| SWAG| 59732|
| Blowjob| 56248|
| muslim| 54053|
| feminist| 53773|
| vagina| 53368|
| emo| 49492|
+---------+------+
only showing top 20 rows

To get the max down_votes for words:

 scala> df3.groupBy("word").agg(max("down_votes").alias("max_down")).orderBy(desc("max_down")).show()
+--------------+--------+
| word|max_down|
+--------------+--------+
| cartossin| 100377|
| sex| 79134|
| hipster| 42696|
| Sex| 32887|
| Penis| 31492|
| SWAG| 30924|
| Emo| 27893|
| Pussy| 27388|
| Chode| 27349|
| yolo| 25687|
| nigger| 25614|
| emo| 25112|
| Blowjob| 24977|
| Donkey Punch| 24133|
| love| 23548|
|FEED THE BIRDS| 23413|
| Prison Money| 23003|
| FTB| 22753|
| Browncoated| 22340|
| Geek Jizzed| 22283|
+--------------+--------+
only showing top 20 rows

All done. They are fun.

Return to home | Generated on 09/29/22