I wrote a script to generate the ramdom dataset for hive table.

The table definition:

     > create table books (
> id int,
> name string,
> lang struct<eng:int,chn:int>,
> types array<string>,
> price float)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|'
> COLLECTION ITEMS TERMINATED BY ',';
OK
Time taken: 0.253 seconds

The sample dataset:

 1|core golang|1,0|AI,development|84.02
2|mastering cgi|1,0|development,IoT|89.03
3|mastering scala|0,1|database,networking|35.80
4|mastering lisp|1,0|cloud,cloud|19.15
5|24 days for ceph|1,0|cloud,mobile|6.04

Given the data is stored in text file "books.txt". We will put this file into HDFS for hive ingestion.

 $ hdfs dfs -put books.txt /tmp/test/

Then in Hive shell we can load the data:

 > load data inpath "/tmp/test/books.txt" overwrite into table books;
Loading data to table default.books
OK
Time taken: 0.805 seconds

Give a query as below:

 > select count(distinct(name)) from books;
OK
91
Time taken: 2.254 seconds, Fetched: 1 row(s)

How to generate the dataset? This is the perl script.

 use strict;
sub gen_name {
open HD,"langs.txt" or die $!;
my @list1 = <HD>;
close HD;
open HD,"prep.txt" or die $!;
my @list2 = <HD>;
close HD;
my $id1 = int(rand scalar @list1);
my $id2 = int(rand scalar @list2);
my $prep = $list2[$id2];
my $lang = $list1[$id1];
chomp $prep;
chomp $lang;
"$prep $lang";
}
sub gen_types {
my @types = qw(programming networking engineering systems IoT database mobile development cloud AI web IT);
my $id1 = int(rand scalar @types);
my $id2 = int(rand scalar @types);
"$types[$id1],$types[$id2]"
}

for (1..100) {
my $id = $_;
my $name = gen_name();
my $lang = $id % 3 ? "1,0" : "0,1";
my $types = gen_types();
my $price = sprintf "%.2f", rand(99);
print "$id|$name|$lang|$types|$price","\n";
}

And the sample input files:

 $ cat langs.txt 
perl
python
erlang
julia
rust
ruby
c
c++
lisp
r-lang
haskell
java
scala
raku
cgi
php
kolin
golang
tcp/ip
big-data
cloud-computing
distributed-storage
machine-learning
ceph
openstack
DNS
postfix
dovecot
k8s
heroku
pytorch
aws
google-cloud
rabbitmq
zeromq
mysql
postgres
tensorflow
computer-vision
deep-learning

$ cat prep.txt
learning
practical
action for
introduce to
dive into
24 days for
a guide to
beginning
core
mastering

Good luck with it.

Return to home | Generated on 09/29/22