NaveenGaneshDataLife NGDL: Hadoop in One day

Analyzing Big data with Hive

Hive

Structured Data-SQL

To load data into Hive Table: LOAD DATA INPATH .’dbname/userinfo’ INTO MOVIES;

Pig

Semi structured and Un structured data and the result set can be captured into two variables (sdppig,bvoippig)

grunt>LOAD ‘/user/remsrch/sdp_daily.csv USING PigStorage(‘,’) AS (arrivaldate:chararray,status:int,hrsopen:int,closeddate:int,summary:chararray);

grunt>DUMP sdppig;

grunt> remvar = LOAD ‘/user/remsrch/bvoip_daily.csv USING PigStorage(‘,’) AS (arrivaldate:chararray,status:int,hrsopen:int,closeddate:int,summary:chararray);

grunt> DUMP bvoippig;

grunt> sdphigh=FOREACH sdppig GENERATE arrivaldate,status;

grunt> bvoiphigh=FOREACH bvoippig GENERATE arrivaldate,status;

grunt>combined=JOIN sdphigh by arrivaldate,bvoiphigh by arrivaldate;

grunt>DUMP combined;

grunt>STORE combined into ‘user/remsrch USING PigStorage(‘,’); -- Storing the combined result into HDFS system

grunt>quit;

hdfs dfs –ls /user/remsrch

HBase-NOSQL Database

$hbase Shell

create ‘tablename’,’columnname’,….. –Creates a Table

put ‘tablename’,’rowNumber’, --Inserts data in row and column

get ‘tablename’,’rownumber’ –Query

scan ‘tablename’ –Display all rows and columns

disable ‘tablename’

drop ‘tablename’ –Drops Table

pig can help us to move some data from HDFS to HBase

app_stock is the Table name

scan ‘app_stock’,{ ‘LIMIT’ => 10} –Displays 10 rows from the Table

Ingesting means pushing files into HDFS-Automate all the manual file push using bash scripting into HDFS

Performance Improvements

Tez – High speed engine like parallelism

set hive.execution.engine=tez;
ORC – Allocates the data in neighboring blocks

hive.optimize.ppd=true

hive.enforce.sorting

CREATE TABLE mytable (

...

) STORED AS orc;

Vectorization- scans, aggregations, filters and joins

set hive.vectorized.execution.enabled = true;

set hive.vectorized.execution.reduce.enabled = true;

Cost Based – Considers Query Cost and picks up the least cost

set hive.cbo.enable=true;

set hive.compute.query.using.stats=true;

set hive.stats.fetch.column.stats=true;

set hive.stats.fetch.partition.stats=true;

HDFS

Multi Node cluster

Hadoop Master: 192.168.1.15 (hadoop-master)

Hadoop Slave: 192.168.1.16 (hadoop-slave-1)

Hadoop Slave: 192.168.1.17 (hadoop-slave-2)

NaveenGaneshDataLife NGDL

Labels

Sunday, 20 November 2016

Hadoop in One day

Ingesting means pushing files into HDFS-Automate all the manual file push using bash scripting into HDFS

No comments:

Post a Comment