Wednesday, May 2, 2018

DBSync Cloud Replication Tool for Salesforce






The DBSync Cloud Replication for Salesforce tool is used for synchronizing the data between Salesforce and Database. The Salesforce would be on cloud. The Database could be on cloud or on premises.




Some of the main application of this tool are
·         Building Data Warehouse for Reporting or for Data Mining
·         Data Migration
·         Offline Data Cleansing and may be subsequent synchronization (We can bring the data from Salesforce to offsite database…. Cleans it…. and/or perform manipulations and load it back to Salesforce)

DBSync Cloud Replication can be invoked from
·         Web interface
·         Command line
·         Shell
·         Windows Scheduler
·         Any ETL tool




DBSync Features:

The application creates the objects in the target DB that do not already exist.
The objects can be replicated in batch or in real-time
The objects in Salesforce can be updated from the Data Warehouse (bi-directional)


Monday, February 5, 2018

Hive


I am newbie in Hive and Big Data. Writing some basic hive commands for quick reference-



--================Show Databases===============
hive> show databases;
--================Create Database==============
hive> create database hive_db;
--================Drop Database================
hive> drop database hive_db;

--================Create Table=================
hive> create table emp_hive (empid int, empname string);

hive> show tables;
OK
emp_hive Time taken:" 0.115 seconds Fetched: 2 row(s)
--================Rename A Table===============

hive> ALTER table emp_hive RENAME to emp_hive_New;

hive> show tables:
OK
emp_hive_New
--================Drop table======================
hive> drop table emp_hive_New;

--================Table types and usage=================
Hive has two types of structures like Internal and External tables depending on loading and design of schema in Hive

Internal Tables: 
- First we create the table then load the data.
- This is "data on schema"
- By dropping this table, both data and schema will be removed
- the stored location of this table will be at /user/hive/warehouse

When to choose Internal Table ->
- If the the processing data is available in local file system
- If we want Hive to manage the complete lifecycle of data including the deletion

Create Internal Table:
hive> CREATE TABLE hive_internal_table (id INT, Name STRING);
Row format delimited
Fields terminated by '\t';

Load the Data into internal table 
hive> LOAD DATA INPATH '/user/arora_hive/data.txt ' INTO table hive_internal_table;

Display t he content of the table

hive> select * from hive_internal_table;

Drop the internal table
hive> DROP TABLE hive_internal_table;

when we drop the table, its data is also deleted from Hive


External Tables:
- Data will be available in HDFS. The table is going to create on HDFS data.
- Its schema on Data.
- When we drop the table, we drop the schema only. The data still remains in HDFS.

When to Choose External Table:
- If the processing data is available in HDFS.
- When the files are being used outside of HIVE.

Create External Table:
hive> CREATE EXTERNAL TABLE hive_external_table (id INT, Name string)
Row format delimited
Fields terminated by '\t'
LOCATION '/user/arora_hive/hive_external_table';

If we do not specify the location at the time of table creation, we can load the data manually-

hive> LOAD DATA INPATH '/user/arora_hive/data.txt' INTO TABLE hive_external_table;

Display the content of the table
hive> select * from hive_external_table;

To drop the external table-
hive> DROP TABLE hive_external_table;

--=========Difference in Internal and External Tables
Feature Internal External

Schema Data on Schema Schema on Data
Storage Location /usr/hive/warehouse HDFS LOCATION
Data Availability Within Local File System Within HDFS



What is the difference in Order by and Sort by? Distributed by and cluster by?