NaveenGaneshDataLife NGDL

Blog is to collate the best practices of End to End Database related activities on Data Modelling,ETL,BI and Big-data. Organizations can explore the FOSS(Free and Open-Source software) options for any database,DW and BI related systems depending upon the size of the software and other essential aspects. Examples-FOSS tools: Data modelling -- DB Designer fork; ETL -- PENTAHO,TALEND; BI -- JASPER, Oracle APEX; Big data -- Hadoop; Script Engine -- Pig and Python

Tuesday, 7 February 2017

Quick Steps to fine tune a Oracle database

At first ADDM and AWR reports need to be analyzed to see the real cause on database load and the below steps could be performed later.

1.    Purging recycle Bin
purge recyclebin; -- It Purges the Dropped Objects and releases storage

2.    Space Issues

select df.tablespace_name "Tablespace", totalusedspace "Used MB", (df.totalspace - tu.totalusedspace) "Free MB", df.totalspace "Total MB", round(100 * ( (df.totalspace - tu.totalusedspace)/ df.totalspace))"Pct. Free"from (select tablespace_name,round(sum(bytes) / 1048576) TotalSpace from dba_data_files group by tablespace_name) df,(select round(sum(bytes)/(1024*1024)) totalusedspace, tablespace_name from dba_segments group by tablespace_name) tu where df.tablespace_name = tu.tablespace_name;

Pct-Free -- It should not be <10% unless it’s a less prioritized or unused tables space

select * from v$sgastat where NAME = 'free memory';

shared pool, large pool, java pool should be predominantly large

3.    Check for any Anonymous or long running process
select s.sid,s.serial#,s.username,s.osuser   from v$session s,v$process p where s.paddr=p.addr
select * from v$session where osuser!='oracle';

OSUSER should not be Anonymous or any other Adhoc Users like DESKTOP, Standalone device like Laptop or Mobile

If you identify any Long running process , Go ahead , kill with prior notice to stakeholders

ALTER SYSTEM KILL SESSION '36, 57056, @1' IMMEDIATE;

4.    Result Cache
SELECT dbms_result_cache.status() FROM dual;
Result Cache can be enabled

5.    Trace at session level should be disabled which may increase the load on Log files
ALTER SESSION SET sql_trace = false;

Monday, 6 February 2017

Grammar-Text Analytics - Unstructured Data-Content Management in MySQL

How to use Grammar in MySQL.
MySQL has better Content Management aspects and handles bulk data efficiently
Assume data is loaded into a Table of MySQL

Let us consider a table called CORE_DATA which contains the following columns,
Identifier -- A Unique identifier for the row
SentDate -- Worked date
Content -- Actual text message, Can be a LONGTEXT or MEDIUMTEXT
Summary -- Header or the topic of the text message

CREATE FULLTEXT INDEX content_idx ON hoot_data(content);

MATCH --A keyword to search a string literal

The below Query just gives all the rows which matches a string

(select 1 PrioID,summary,content from CORE_DATA
where upper(CONTENT) LIKE
upper(CONCAT('%', 'Deepika Padukone is awesome' ,'%')) limit 100 )
--- A complete Match
union
(SELECT 2,summary,content FROM CORE_DATA WHERE MATCH (content)
    AGAINST (replace('Deepika Padukone is awesome', ' ' , ' +')
    IN BOOLEAN MODE))
--- A complete Match with ALL words present at different PLACES of the column
union
(select 3,summ,cont from
(
SELECT summary as summ,content as cont, MATCH (content)
AGAINST (replace('Deepika Padukone is awesome', ' ' , ',')
IN BOOLEAN MODE ) ---
--- A partial Match with ANY one word present at different PLACES
AS score FROM HOOTSEARCH.HOOT_DATA) q1 order by score desc limit 100)
order by PrioID

Result set:
Deepika Padukone is awesome -- Result of 1st Union clause
Deepika last name is Padukone and she really look awesome -- 2nd Union
Deepika acted in Ramleela and looking like an angel and her costumes are really nice --3rd Union

Monday, 26 December 2016

Grammar-Text Analytics-Unstructured Data-Content Management in Oracle

It is not Hadoop but just an example on How to use Grammar in Oracle.
Let us assume that we have loaded the data from various sources into Oracle database and need to do some pattern matching in Oracle database.
This article is not going to provide end to end insight but will cover up a quick thought on Content Management and Searching with Text patterns.
CTXSYS is one package which does majority of the grammatical operations.
Run the below commands and grant the respective privileges on CTXSYS.
CTXSYS is a Package which hosts all the Grammar related functionalities that can be used in Content Management.

The below grants to be Run from SYS to enable all Grammar commands:
GRANT EXECUTE ON CTXSYS.CTX_CLS TO training;
GRANT EXECUTE ON CTXSYS.CTX_DDL TO training;
GRANT EXECUTE ON CTXSYS.CTX_DOC TO training;
GRANT EXECUTE ON CTXSYS.CTX_OUTPUT TO training;
GRANT EXECUTE ON CTXSYS.CTX_QUERY TO training;
GRANT EXECUTE ON CTXSYS.CTX_REPORT TO training;
GRANT EXECUTE ON CTXSYS.CTX_THES TO training;
GRANT EXECUTE ON CTXSYS.CTX_ULEXER TO training;
GRANT EXECUTE ON CTXSYS.CTX_MATCHES TO training;
GRANT EXECUTE ON CTXSYS.DR_DEF TO training;
GRANT EXECUTE ON CTXSYS.DRICON TO training;
GRANT EXECUTE ON CTXSYS.CTX_CATSEARCH TO training;

It gives is the respective Indexes on CTXSYS, These Indexes are mandatory to enable any Text based Search patterns for the required columns.
select owner, index_name, index_type, status, domidx_status, domidx_opstatus from dba_indexes where ityp_owner='CTXSYS'

Let us consider a table called CORE_DATA which contains the following columns,
Identifier -- A Unique identifier for the row
SentDate -- Worked date
Content -- Actual text message, Can be a CLOB or LONG data type
Summary -- Header or the topic of the text message

CREATE INDEX TRAINING.EM_CONSTRTIDX ON TRAINING.CORE_DATA (content) INDEXTYPE IS CTXSYS.CONTEXT

CONTAINS -- To check the existence of any keyword inside a text
NEAR -- To check the existence of multiple words inside a text

select * from core_data where contains(content, 'wonderful', 1) > 0
select * from core_data where contains(content, 'deepika near wonderful')>0
select * from core_data where contains(content, 'Deepika near Wonderful near looking')>0

Result set:
Heroines are wonderful
Deepika looks wonderful
Deepika acted in Ramleela and looking like an angel and her costumes are really wonderful

Achieve 100% Availability on BI reports - Synonym Toggling of ETL

Any BI manipulates source data and during that time, The reports may have some downtime or it could be stale for a while.
Example, Ledger postings at EoD takes 45 min approx. to complete the ETL process and load the Daily Accounting reports.
For that 45 min. The reports should still be accessible by the users without any Downtime or exceptions.
Note :
Few OLAP BI tools already provides Cache techniques or auto refresh capabilities.
It can be implemented for FOSS based ETL/reporting tools where the Features are limited.

Synonym Togging:

Assume MAIN_SYN(Synonym) is a most importing source table through which the DIM,FACT and SUMMARY gets refreshed.

You can apply the concept for any number of tables.
STAGING_TEMP1 and STAGING_TEMP2 are just same copies of Staging tables which is called by the synonym MAIN_SYN.

Talend In One Day

Basic Components	Explanation
tFileInputDelimited	Reads any Delimited .txt or .csv file and
tFileOutputDelimited	Silimar and generates a delimited output file -.csv
tFIleInputXML	Reads any XML file as input and extract the Xpath. Store the XML in any location. Specify the Xpath Loop expression <employee> Fields to extract<empname>,<DOB>
tRowGenerator	Genarates rows sequentially. Go to Talend Data Fabric and select the functions accordingly. OrderID,Date sequences gets auto generated. Similar to a Oracle Sequence
tFixedFlowInput	Manually creates Fixed dataset. BasicSettings->Create 2 columns State_Cd,State_Nm and enter static values Change No Of Rows and it generates static rows accordingly InlineTable allows to add Multiple rows and multiplies with the rows * no. of columns
tMap	Performs Join between 2 input files Expr.Key performs the actual join. row1 and row2 need to be joined with a common coloumn in the Expr.key
tSample	Generates sequential numbers as a field in the output.Range can be changed in the Basic settings
tJava	Run an Java standalone Core Java code in Basic settings
tSalesforceBulkExec	Give the SalesForce Webservice SOAP URL with username and password. Get the mySF0.1 from the Salesforce Metadata
tLibraryLoad	Upload any external Library(JAR file) which can be called from tJava.The tJava utility can be invoked with OnComponentOk

Big Data Components	Explanation
tHDFSConnection	Establishes a connection to a HDFS system Distribution:Apache NameNodeURI:hdfs://10.140.6.125:9000/ Username: <username of the virtual machine> tHDFSConnection OnsubjobOK can be connected to tRowGenerator and tLogrow which ends with tHDFSOutput_1
tHDFSOutput	Generates the Output and stores in the Hadoop system Filename:\user\hadoop\talendoutput\out.txt Action: Create Run the Program Go to HDFS, hadoop fs -ls /user/hadoop/talendoutput hadoop fs -cat /user/hadoop/talendoutput/out.txt
tHDFSInput	To retrive data from Hadoop HDFS file. Make the tHDFSConnection Make the tHDFSInput Enter Filename /user/hadoop/talendoutput/out.txt and Fieldseparator Connect with tLogrow

Sunday, 20 November 2016

Data Model of a Employee Management with Shift schedules

Let us consider that we need to Model a employee Shift management system.This snapshot is just a ideology and could vary based on the architecture principles.

Hadoop in One day

Analyzing Big data with Hive

Hive

Structured Data-SQL

To load data into Hive Table: LOAD DATA INPATH .’dbname/userinfo’ INTO MOVIES;

Pig

Semi structured and Un structured data and the result set can be captured into two variables (sdppig,bvoippig)

grunt>LOAD ‘/user/remsrch/sdp_daily.csv USING PigStorage(‘,’) AS (arrivaldate:chararray,status:int,hrsopen:int,closeddate:int,summary:chararray);

grunt>DUMP sdppig;

grunt> remvar = LOAD ‘/user/remsrch/bvoip_daily.csv USING PigStorage(‘,’) AS (arrivaldate:chararray,status:int,hrsopen:int,closeddate:int,summary:chararray);

grunt> DUMP bvoippig;

grunt> sdphigh=FOREACH sdppig GENERATE arrivaldate,status;

grunt> bvoiphigh=FOREACH bvoippig GENERATE arrivaldate,status;

grunt>combined=JOIN sdphigh by arrivaldate,bvoiphigh by arrivaldate;

grunt>DUMP combined;

grunt>STORE combined into ‘user/remsrch USING PigStorage(‘,’); -- Storing the combined result into HDFS system

grunt>quit;

hdfs dfs –ls /user/remsrch

HBase-NOSQL Database

$hbase Shell

create ‘tablename’,’columnname’,….. –Creates a Table

put ‘tablename’,’rowNumber’, --Inserts data in row and column

get ‘tablename’,’rownumber’ –Query

scan ‘tablename’ –Display all rows and columns

disable ‘tablename’

drop ‘tablename’ –Drops Table

pig can help us to move some data from HDFS to HBase

app_stock is the Table name

scan ‘app_stock’,{ ‘LIMIT’ => 10} –Displays 10 rows from the Table

Ingesting means pushing files into HDFS-Automate all the manual file push using bash scripting into HDFS

Performance Improvements

Tez – High speed engine like parallelism

set hive.execution.engine=tez;
ORC – Allocates the data in neighboring blocks

hive.optimize.ppd=true

hive.enforce.sorting

CREATE TABLE mytable (

...

) STORED AS orc;

Vectorization- scans, aggregations, filters and joins

set hive.vectorized.execution.enabled = true;

set hive.vectorized.execution.reduce.enabled = true;

Cost Based – Considers Query Cost and picks up the least cost

set hive.cbo.enable=true;

set hive.compute.query.using.stats=true;

set hive.stats.fetch.column.stats=true;

set hive.stats.fetch.partition.stats=true;

HDFS

Multi Node cluster

Hadoop Master: 192.168.1.15 (hadoop-master)

Hadoop Slave: 192.168.1.16 (hadoop-slave-1)

Hadoop Slave: 192.168.1.17 (hadoop-slave-2)

Labels