script and spark for small dataset

Few days ago I met an issue with perl to analyse 80+ million items from a csv file.

That will break the memory since I run it on a 2G memory VPS.

Today I rerun it with another dataset which is smaller having 5 million items only.

This time perl does the job pretty well. The script as follows:

 use strict;
 my %hash;
 my %stat;
 # dataset:
 # Arthur Y.B.,Male,1979-7-10,56607,[email protected],Loan Officer,23331.0
 open HD,"person.csv" or die $!;
 while(<HD>) {
     chomp;
     my ($job,$salary) = (split /\,/)[-2,-1];
     $hash{$job}{total} += $salary;
     $hash{$job}{count} +=1;
 }
 close HD; 
 for my $key (keys %hash) {
     $stat{$key} = $hash{$key}{total} / $hash{$key}{count};
 }
 my $i = 0;
 for (sort { $stat{$b} <=> $stat{$a}} keys %stat) {
     printf "%-40s%.10f\n", $_, $stat{$_};
     last if $i == 19;
     $i ++;
 }

And the perl script's running result:

 $ time perl count.pl 
 Software Developer                      17572.9448866016
 Plumber                                 17572.5436757512
 Recreation & Fitness Worker             17568.1629235022
 Veterinary Technologist & Technician    17567.2669899038
 Occupational Therapist                  17562.4286936453
 Cashier                                 17553.7658357742
 Marketing Manager                       17553.3646477380
 Maid & Housekeeper                      17551.8888711735
 Executive Assistant                     17550.3703990471
 Diagnostic Medical Sonographer          17549.4909587512
 Medical Assistant                       17548.7211987571
 Financial Analyst                       17545.7428859941
 Logistician                             17543.4005038291
 Financial Advisor                       17542.6550936134
 Landscaper & Groundskeeper              17541.8355272385
 Telemarketer                            17538.7797860791
 Sales Manager                           17534.8601334528
 Construction Manager                    17534.3827493797
 Marriage & Family Therapist             17531.9513995878
 Auto Mechanic                           17527.9992342878
 real	0m6.403s
 user	0m6.304s
 sys	0m0.088s

While Apache Spark run it for 7 seconds. Perl is even a bit faster.

 scala> df.groupBy("job").agg(avg("salary").alias("avg_salary")).orderBy(desc("avg_salary")).show(false)
 +------------------------------------+------------------+                       
 |job                                 |avg_salary        |
 +------------------------------------+------------------+
 |Software Developer                  |17572.94488660164 |
 |Plumber                             |17572.54367575122 |
 |Recreation & Fitness Worker         |17568.16292350223 |
 |Veterinary Technologist & Technician|17567.266989903826|
 |Occupational Therapist              |17562.4286936453  |
 |Cashier                             |17553.765835774204|
 |Marketing Manager                   |17553.36464773796 |
 |Maid & Housekeeper                  |17551.88887117347 |
 |Executive Assistant                 |17550.370399047053|
 |Diagnostic Medical Sonographer      |17549.490958751172|
 |Medical Assistant                   |17548.721198757143|
 |Financial Analyst                   |17545.742885994143|
 |Logistician                         |17543.4005038291  |
 |Financial Advisor                   |17542.655093613437|
 |Landscaper & Groundskeeper          |17541.835527238483|
 |Telemarketer                        |17538.779786079052|
 |Sales Manager                       |17534.860133452843|
 |Construction Manager                |17534.382749379653|
 |Marriage & Family Therapist         |17531.951399587826|
 |Auto Mechanic                       |17527.99923428779 |
 +------------------------------------+------------------+
 only showing top 20 rows

When there is no memory limit a script can finish the statistics job quickly.

Though most of our online datasets have 100+ million items. This will not be suitable for a script to analyse.

Return to home | Generated on 09/29/22