.: 2015

Saturday, December 12, 2015

Create Basic Date Dimension with Data

This statement will create basic Date Dimension table with data starting from 1-Jan-2014 and till plus 2000 days.

CREATE TABLE DATE_DIM AS
SELECT
n AS Date_ID, -- (This is just a serial number not a date)
1 as DATE_KEY,
TO_DATE('31/12/2013','DD/MM/YYYY') + NUMTODSINTERVAL(n,'day') AS Full_Date, --- (PK)
TO_CHAR(TO_DATE('31/12/2013','DD/MM/YYYY') + NUMTODSINTERVAL(n,'day'),'DD') AS Days,
TO_CHAR(TO_DATE('31/12/2013','DD/MM/YYYY') + NUMTODSINTERVAL(n,'day'),'Mon') AS Month_Short,
TO_CHAR(TO_DATE('31/12/2013','DD/MM/YYYY') + NUMTODSINTERVAL(n,'day'),'MM') AS Month_Num,
TO_CHAR(TO_DATE('31/12/2013','DD/MM/YYYY') + NUMTODSINTERVAL(n,'day'),'Month') AS Month_Long,
TO_CHAR(TO_DATE('31/12/2013','DD/MM/YYYY') + NUMTODSINTERVAL(n,'day'),'YYYY') AS YEAR
FROM (
SELECT LEVEL n
FROM dual
CONNECT BY LEVEL <= 2000
);

update DATE_DIM SET DATE_KEY = YEAR||Month_Num||Days ;
--DATE_KEY in in format YYYYMMDD

Sunday, November 15, 2015

Data Warehouse Hardware Sizing (Disk/RAM/CPU)

While deciding on the sizing of hardware required we are primarily looking for

1) Disk Space Required for Machine hosting Data Warehouse
2) RAM on the Machine
3) CPU

This is for ELT method where Data Warehouse and Integration Server would reside on same machine.

(not the ETL, where DW and integration server will be on different machines)

Disk Space: We will need to consider following->

a) What size of master tables and transaction tables we will pull from source system into Data Warehouse that will form Dimensions and Facts.

b) Space would be needed by staging area. We may not pull all the columns of a particular table, we may pull only selective columns so space needed would be less.

c) The tables in source system will be in Normalized form whereas in DW they are demoralized so there would be redundancy and so more space need for them in DW.

d) The dimensions that we create in DW may need to retain history, i.e. SCD implementation so space needed would be more.

e) The indexes created on Dimension and Fact tables would need space.

f) There would be aggregated/ summarized tables and/or MVs over and above leaf level fact tables so space would be needed for them. The number of summary level fact would depend on the levels in dimensions, more the levels in dimensions more summary tables would require. Indexes would be required in summary tables also so would need space.

Wednesday, August 12, 2015

Master Data and Reference Data

Master Data:

A master data is information that an enterprise agree upon. And enterprise may or most likely to have several departments like HR, Marketing, Sales, Procurement, Manufacturing etc.

Now all these departments may have different system, HR department may have some HRMS, Marketing and Sales departments may have some Marketing and/or Sales Management system.

You would have products in the Marketing system and in the sales system too. That product data should be same in both the system, these are the products that both Market and sales agreed to use. It should not happen that both the
systems have different set of products.

Another example could be Customer data.

Another example could be Employee data.

Another example could be Supplier data.

Reference Data:

Reference data is something that defines permissible limits to use the data.
Like currency codes which would be very much standards and would rarely change.

Reference data is mostly used to maintain data quality or data validation or data standardization purpose.

If I am maintaining currency codes, then all the systems (be it HR, Sales etc..) should use same currency codes.

Another example of reference data could be Country Codes or State Codes.

Saturday, August 1, 2015

Tableau Vs OBIEE

The major difference between OBIEE and Tableau is that Tableau doesn't really have a semantic layer (the BMM layer in OBIEE). With Tableau, we have to have data well manicured (cosmetic treatment) in order to work effectively. There are differences but thetwo can and should co-exist in any organization.

Tableau Merits Over OBIEE

· Quick and rapid development of reports and dashboards. Easy enough for business people to figure out with little to no training. Companies may use Tableau for prototyping and then building the enterprise wide reports in something like OBIEE.

· Great desktop reporting tool.

· Has an excellent recommendation tool for which type of visual analysis to use based on selected data.

· Almost true WYSIWIG.

· Positioning of dashboard objects much more flexible.

· Good free on demand training and extremely active user community.

· Maps are very easy to use with lots of cool functionality.

OBIEE Merits Over Tableau

· More advanced dashboarding capabilities such as conditional guided navigation. Also much better at managing tons of dashboards. Use of sub reports.

· Stronger enterprise level security and user management.

· Shared reporting objects like filters

· Better at federating data (connecting to multiple sources) than Tableau. Tableau can do it and is OK at it, but OBIEE is the best of all reporting tools at it.

· The best functioning KPI/scorecard reports, an area where OBIEE rocks the competition.

· Great integration to other Oracle products. IE, can interface directly with Fusion/EBS/Forms.

· Has multi-user development support and better environment management/migration. we don't need multiple environments for Tableau or multiple users working on the same reports.

· Ad-Hoc reports can be created through the web interface by any user you grant access to. That user, however, is limited to what you have defined in your semantic layer. Tableau requires their client to do ad-hoc. However, the user can then do any kind of logical calculations or other data manipulation they wish.

· Can cache queries. Works very well and write very efficient SQL when directly querying databases. Tableau has a proprietary data format that basically works as a cube, which can slow down very quickly on large data sets. It also is is pretty lackluster at performing direct queries in the database.

While Tableau continues to add features very quickly, it isn't up to a true enterprise wide reporting platform. It is still best served for departmental use in the hands of a data analyst.

OBIEE requires a lot more effort to create even basic visualizations and usually requires a full development team.

There are pros and cons to both.

Difference between Data Blending and Data Join

Joining is where you are able to combine data from the SAME data source, for example worksheets in an Excel file or tables in an Oracle database. You need a common field, also called a key, that tells Tableau how to join the data. You also need a key to do blending.

Blending is where you able to combine data from DIFFERENT data sources, for example a worksheet in Excel with a table from an Oracle database. This is very powerful because in the past you would typically need to have someone with the specific technical skills create a new data set with the combined information.

Wednesday, July 1, 2015

Tableau: Connections you can make with your dataset

We can either connect live to our data set or extract data onto Tableau.

Live: Connecting live to a data set leverages its computational processing and storage. New queries will go to the database and will be reflected as new or updated within the data.

Extract: An extract will make a static snapshot of the data to be used by Tableau’s data engine. The snapshot of the data can be refreshed on a recurring schedule as a whole or incrementally append data. One way to set up these schedules is via the Tableau server.

The benefit of Tableau extract over live connection is that extract can be used anywhere without any connection and you can build your own visualization without connecting to database.

When to use Live Connection:

When you need more updated data, or if the data is updated in the database more frequently and you need updated data to reflect in the workbook/dashboard.

Database is Fast.

When to use Extract:

When Database is not fast enough for interactive analysis (you change the value in the filter and workbook takes long to refresh)

When you want to take the load off the database.

When you need to be off line (in air plane)and want to play with the workbook.

Friday, January 9, 2015

Early Arriving Fact (or Late Arriving Dimension)

What early arriving facts are? (Or late arriving dimension)

An early arriving fact, when a fact record arrives at the data warehouse before its dimension record.

when we lookup the surrogate key in our dimension table using the fact table natural key, the dimension member doesn’t exist (yet).

In many cases when we load our data warehouse, we would assume that dimensions and facts would arrive at the same time, or dimension would arrive before fact record that has reference to the dimension.

The reasons why we have to be able to handle early arriving facts are many

a) multiple source systems and/or
b) different load frequencies seems to be the dominating factors.

We might be loading our dimensions using a full load strategy with one load schedule, while using an incremental load strategy for our fact table with a close to real-time frequency.

An employee availing medical insurance through his employer is eligible for insurance coverage from the first day of employment. But the employer may not provide the medical insurance information to the insurance provider for several weeks. If the employee undergo any medical treatment during this time, his medical claim records will come as fact records with out having the corresponding patient dimension details.

Design Approaches

Depending on the business scenario and the type of dimension in use, we can take different design approaches.
1. Hold the Fact record until Dimension record is available.
2. 'Unknown' or default Dimension record.

1. Hold the Fact record until Dimension record is available
One approach is to place the fact row in a suspense table. The fact row will be held in the suspense table until the associated dimension record has been processed. This solution is relatively easy to implement, but the primary drawback is that the fact row isn’t available for reporting until the associated dimension record has been handled.

This approach is more suitable when your data warehouse is refreshed as a scheduled batch process and a delay in loading fact records until the dimension records are available is acceptable for the business.

2. Unknown' or default Dimension record
Another approach is to simply assign the “Unknown” dimension member to the fact record. On the positive side, this approach does allow the fact record to be recorded during the ETL process. But it won’t be associated with the correct dimension value.

The "Unknown" fact records can also be kept into a suspense table. Eventually, when the Dimension data is processed, the suspense data can be reprocessed and associate with a real, valid Dimension record.