Analyzing Big data with Hive
Hive
Structured Data-SQL
To load data into Hive Table: LOAD DATA INPATH .’dbname/userinfo’
INTO MOVIES;
Pig
Semi structured and Un structured data and the result set can be captured
into two variables (sdppig,bvoippig)
grunt>LOAD ‘/user/remsrch/sdp_daily.csv USING
PigStorage(‘,’) AS (arrivaldate:chararray,status:int,hrsopen:int,closeddate:int,summary:chararray);
grunt>DUMP sdppig;
grunt> remvar = LOAD
‘/user/remsrch/bvoip_daily.csv USING
PigStorage(‘,’) AS (arrivaldate:chararray,status:int,hrsopen:int,closeddate:int,summary:chararray);
grunt> DUMP bvoippig;
grunt> sdphigh=FOREACH sdppig GENERATE arrivaldate,status;
grunt> bvoiphigh=FOREACH bvoippig GENERATE
arrivaldate,status;
grunt>combined=JOIN sdphigh by arrivaldate,bvoiphigh by arrivaldate;
grunt>DUMP combined;
grunt>STORE combined into ‘user/remsrch USING PigStorage(‘,’); -- Storing
the combined result into HDFS system
grunt>quit;
hdfs dfs –ls /user/remsrch
HBase-NOSQL Database
$hbase Shell
create ‘tablename’,’columnname’,…..
–Creates a Table
put ‘tablename’,’rowNumber’, --Inserts
data in row and column
get ‘tablename’,’rownumber’ –Query
scan ‘tablename’ –Display all rows and
columns
disable ‘tablename’
drop ‘tablename’ –Drops Table
pig can help us to move some data from
HDFS to HBase
app_stock is the Table name
scan ‘app_stock’,{ ‘LIMIT’ => 10}
–Displays 10 rows from the Table
Ingesting means pushing files into HDFS-Automate all the manual file push using bash scripting into HDFS
Performance Improvements
Tez – High
speed engine like parallelism
set hive.execution.engine=tez;
ORC – Allocates the data in neighboring blocks
ORC – Allocates the data in neighboring blocks
hive.optimize.ppd=true
hive.enforce.sorting
CREATE TABLE mytable (
...
) STORED AS orc;
Vectorization-
scans,
aggregations, filters and joins
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
Cost Based – Considers Query Cost and picks up the least cost
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
HDFS
Multi Node
cluster
Hadoop Master: 192.168.1.15 (hadoop-master)
Hadoop Slave: 192.168.1.16 (hadoop-slave-1)
Hadoop Slave: 192.168.1.17 (hadoop-slave-2)
No comments:
Post a Comment