Wednesday, February 3, 2016

Features to consider while selecting Reporting/Analytic tool


Features to consider while selecting Reporting/Analytic tool-


  1. Does BI tool has semantic layer? like RPD in OBIEE and Universe i think in Business Object? The idea is to create Data Models in the semantic layers and use them to create multiple reports.
  2. Does BI tool  has provision of creating hierarchies? If yes, can we create ragged and skip level hierarchies?
  3. Self-Service BI
  4. Advanced Analytics (KPI/Scorecard)
  5. Ease of Deployment
  6. Ad-hoc reporting
  7. Slice and Dice
  8. Drilling up/down/ Drill Through
  9. Constraint Prompt (or Cascade Filter)
  10. Formatted Reports (Pixel Perfect Report)
  11. Report Bursting
  12. Geospatial Analysis
  13. Ease of Filtering data on dashboard
  14. Development Life-cycle
  15. Ability to export the reports in excel/csv/pdf etc
  16. Can be Deployed on Cloud?
  17. Conditional Delivery of Reports (i.e. deliver the report in user's mail box only when sales drops below a pre-set-level).
  18. Conditional formatting supported (highlights cells in red when sales drops below a level and highlight in green when it is above a level etc.)
  19. Writeback Support
  20. Event driven cache/extract refresh
  21. Do we have provision of Guided Navigation?? i.e. one dashboard has multiple reports but which report will be visible will depend of a certain condition and that certain condition is derived by another report).  OBIEE has this Guided Navigation feature.
  22. Can the BI tool's semantic layer connect to Cubes?
  23. Does BI tool support Impersonation?


Tuesday, February 2, 2016

ETL Best Practices

Some of the Best Practice or Consideration in ETL


1) Running Scripts in Parallel:
As much as possible the code should run in parallel. Like all the staging tables can be loaded from the source database in parallel. Subsequently all the scripts that load the data from staging into dimensions tables can be be run in parallel, provided they do not have dependency on each other.In case of dependencies the sequential loading can be performed.

2) Periodic Maintenance:Vaccum and Analyze can be performed periodically, these are for databases like Amazon Redshift and Greenplum. For database like Oracle we should refresh the statistics of the tables and Indexes periodically usually during off hours.

3) Handling Bad Data:Handling bad data, in case there is bad data in the source data, instead of rejecting the entire load we can process the good data and load the bad data into a error log table and also generate the email alert informing set of users (business and/or technical).

4) Parameters: Use of Metadata or Control table. The code should be parameterized. For example, the connection details of the source and target data should be in parameter file or in a central place, like  source/target host name, userid/password etc. The idea is to not to have these kind of details in the code and avoid touching all the script if we need to point to a different source or target database.

5) Incremental Load:As much as possible we should try for loading the data from source to target (in staging) incrementally. This will have substantial effect on  overall ETL time. (If need, we can consider implementing CDC, Change Data Capture)


6) Consistent Method Error Handling and Notification

7) Which Data Warehouse Approach: 

There are mainly two DW approaches - Top Down and Bottom Up. The selected approach will deeply effect the DW life cycle.

Top Down: Build Enterprise Data Warehouse First and feed the Data Marts later.

Bottom Up: Build the Data Marts first and then Enterprise Data Warehouse from the DMs as and when needed.


8) ETL Tool Selection: 
A very crucial step specially for a new Data Warehouse initiative. A comparison of ETL tools can be performed providing weightage against various features that the ETL tool support. There could be a case when one ETL tool performs great in a specific feature but then we don't need that feature in our DW so in that case our tool selection may not heavily depend on that feature and we may assign less weightage against that. This way we can create a matrix and make a decision based on over all scoring. 


9) Master Data Management:
(The aim of the Data integration is to provide 360 degree view of the enterprise data. Data integration would be necessary when the end reports from the BI system would include data elements from various source systems.

For the context of our ETL environment, data integration takes the form of conforming dimensions and conforming facts in the data warehouse. Conforming dimensions means establishing common dimensional attributes across separated fact tables so that “drill across” reports can be generated using these attributes. Conforming facts means making agreements on common business metrics such as key performance indicators (KPIs) across separated databases so that these numbers can be compared mathematically for calculating differences and ratios.)


Reference: 
http://www.computerweekly.com/tip/Six-ETL-best-practices-followed-by-Shoppers-Stop

https://www.timmitchell.net/etl-best-practices/

http://www.dataintegration.info/etl




Features to Consider While Selecting ETL Tool



Features for consider while selecting ETL tool-


  1. Easy to Use
  2. Large Volume Performance
  3. Built-in Schedular
  4. Non-RDBMS Connections
  5. Join Multiple Sources  (Heterogeneous)
  6. Deploy in Cloud Option
  7. Complex Transformation
  8. System Platform Supported
  9. SCD Support
  10. Debugging Facility
  11. CDC Support
  12. Command Line Operations
  13. Job Controlling and Scheduling
  14. Commercial  or Open Source
  15. Built-in Version Control