Sunday, 20 November 2016

Hadoop in One day



Analyzing Big data with Hive

Hive
Structured Data-SQL


To load data into Hive Table: LOAD DATA INPATH .’dbname/userinfo’ INTO MOVIES;


Pig
Semi structured and Un structured data and the result set can be captured into two variables (sdppig,bvoippig)
grunt>LOAD ‘/user/remsrch/sdp_daily.csv  USING PigStorage(‘,’)  AS (arrivaldate:chararray,status:int,hrsopen:int,closeddate:int,summary:chararray);
grunt>DUMP sdppig;
grunt> remvar = LOAD ‘/user/remsrch/bvoip_daily.csv  USING PigStorage(‘,’)  AS (arrivaldate:chararray,status:int,hrsopen:int,closeddate:int,summary:chararray);
grunt> DUMP bvoippig;
grunt> sdphigh=FOREACH sdppig GENERATE arrivaldate,status;
grunt> bvoiphigh=FOREACH bvoippig GENERATE arrivaldate,status;
grunt>combined=JOIN sdphigh by arrivaldate,bvoiphigh by arrivaldate;
grunt>DUMP combined;
grunt>STORE combined into ‘user/remsrch USING PigStorage(‘,’); -- Storing the combined result into HDFS system
grunt>quit;
hdfs dfs –ls /user/remsrch

HBase-NOSQL Database
$hbase Shell
create ‘tablename’,’columnname’,….. –Creates a Table
put ‘tablename’,’rowNumber’, --Inserts data in row and column
get ‘tablename’,’rownumber’ –Query
scan ‘tablename’ –Display all rows and columns
disable ‘tablename’
drop ‘tablename’ –Drops Table
pig can help us to move some data from HDFS to HBase
app_stock is the Table name

scan ‘app_stock’,{ ‘LIMIT’ => 10} –Displays 10 rows from the Table

Ingesting means pushing files into HDFS-Automate all the manual file push using bash scripting into HDFS








Performance Improvements
Tez – High speed engine like parallelism
set hive.execution.engine=tez;
ORC – Allocates the data in neighboring blocks
hive.optimize.ppd=true
hive.enforce.sorting
CREATE TABLE mytable (
...
) STORED AS orc;
Vectorization- scans, aggregations, filters and joins
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
Cost Based – Considers Query Cost and picks up the least cost
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

HDFS
Multi Node cluster
Hadoop Master: 192.168.1.15 (hadoop-master)
Hadoop Slave: 192.168.1.16 (hadoop-slave-1)
Hadoop Slave: 192.168.1.17 (hadoop-slave-2)


No comments:

Post a Comment