how to clean data with regex in spark

Today I loaded a csv which is the population data into Spark. When I printed the schema I found there was issue.

 scala> df.printSchema
 root
  |-- Year: integer (nullable = true)
  |-- Age: integer (nullable = true)
  |-- Ethnic: integer (nullable = true)
  |-- Sex: integer (nullable = true)
  |-- Area: string (nullable = true)
  |-- count: string (nullable = true)

As you see the last two columns were String types. But they should be numeric.

So I have to use a regex to extract all the numeric columns from the original dataframe.

 scala> val regex = """[0-9]+"""
 regex: String = [0-9]+
 scala> val clean_df = df.filter($"area".rlike(regex)).filter($"count".rlike(regex))
 clean_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Year: int, Age: int ... 4 more fields]

Now the columns "area" and "column" in new dataframe should include numbers only.

Next we can aggregate the corresponding columns. Before this we need to cast the column to int type.

For instance, I want to query the total people count for each ethnic, then run the statement below.

 scala> clean_df.withColumn("count",$"count".cast("int")).
        groupBy("Ethnic").agg(sum("count").alias("total")).
        orderBy(desc("total")).show
 +------+----------+                                                             
 |Ethnic|     total|
 +------+----------+
 |  9999|1072563227|
 |    77|1038507333|
 |     1| 728167957|
 |     2| 158110844|
 |     4| 130452499|
 |     3|  80360492|
 |     6|  45757381|
 |    61|  44103959|
 |     9|  33142422|
 |     5|  12989397|
 |    69|   1397717|
 +------+----------+

All were done fine.

So with regex in Scala and the rlike() column method in Spark, we can extract the wanted rows.

This calls "Outlier Handling" which is used widely in feature engineering.

Return to home | Generated on 09/29/22