It's quite easy to query webserver log with Apache Spark. Here I give sample queries for default access.log on Apache httpd server.

For instance we want to query the top 10 IPs who accessed our website.

 scala> val rdd = sc.textFile("tmp/access.log")
val rdd: org.apache.spark.rdd.RDD[String] = tmp/access.log MapPartitionsRDD[1] at textFile at <console>:1

scala> rdd.map(x => (x.split("""\s+""")(0),1)).reduceByKey(_+_).sortBy(-_._2).take(10).foreach(println)
(141.101.105.219,656)
(172.70.142.3,263)
(162.158.159.91,47)
(172.68.110.165,47)
(172.70.90.139,47)
(172.70.162.61,47)
(141.101.99.236,47)
(172.70.147.105,28)
(172.70.54.225,24)
(172.70.142.237,16)

And we want to know the top 5 UA (useragent) who came to our website.

 scala> val regex = """.*\"(.*)\"""".r
val regex: scala.util.matching.Regex = .*\"(.*)\"

scala> rdd.map { case regex(bot) => (bot,1) }.reduceByKey(_+_).sortBy( -_._2).take(5).foreach(println)
(Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36,655)
(Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0,351)
(Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4240.193 Safari/537.36,262)
(Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36,188)
(WordPress/5.9.3; https://xxx.com,27)

As long as you know how webserver logs were organized, you can run queries in Spark to get any info you wanted from the logs.

Return to home | Generated on 09/29/22