Features to consider while selecting Reporting/Analytic tool-
- Does BI tool has semantic layer? like RPD in OBIEE and Universe i think in Business Object? The idea is to create Data Models in the semantic layers and use them to create multiple reports.
- Does BI tool has provision of creating hierarchies? If yes, can we create ragged and skip level hierarchies?
- Self-Service BI
- Advanced Analytics (KPI/Scorecard)
- Ease of Deployment
- Ad-hoc reporting
- Slice and Dice
- Drilling up/down/ Drill Through
- Constraint Prompt (or Cascade Filter)
- Formatted Reports (Pixel Perfect Report)
- Report Bursting
- Geospatial Analysis
- Ease of Filtering data on dashboard
- Development Life-cycle
- Ability to export the reports in excel/csv/pdf etc
- Can be Deployed on Cloud?
- Conditional Delivery of Reports (i.e. deliver the report in user's mail box only when sales drops below a pre-set-level).
- Conditional formatting supported (highlights cells in red when sales drops below a level and highlight in green when it is above a level etc.)
- Writeback Support
- Event driven cache/extract refresh
- Do we have provision of Guided Navigation?? i.e. one dashboard has multiple reports but which report will be visible will depend of a certain condition and that certain condition is derived by another report). OBIEE has this Guided Navigation feature.
- Can the BI tool's semantic layer connect to Cubes?
- Does BI tool support Impersonation?
Some of the Best Practice or Consideration in ETL
1) Running Scripts in Parallel:
As much as possible the code should run in parallel. Like all the staging tables can be loaded from the source database in parallel. Subsequently all the scripts that load the data from staging into dimensions tables can be be run in parallel, provided they do not have dependency on each other.In case of dependencies the sequential loading can be performed.
2) Periodic Maintenance:Vaccum and Analyze can be performed periodically, these are for databases like Amazon Redshift and Greenplum. For database like Oracle we should refresh the statistics of the tables and Indexes periodically usually during off hours.
3) Handling Bad Data:Handling bad data, in case there is bad data in the source data, instead of rejecting the entire load we can process the good data and load the bad data into a error log table and also generate the email alert informing set of users (business and/or technical).
4) Parameters: Use of Metadata or Control table. The code should be parameterized. For example, the connection details of the source and target data should be in parameter file or in a central place, like source/target host name, userid/password etc. The idea is to not to have these kind of details in the code and avoid touching all the script if we need to point to a different source or target database.
5) Incremental Load:As much as possible we should try for loading the data from source to target (in staging) incrementally. This will have substantial effect on overall ETL time. (If need, we can consider implementing CDC, Change Data Capture)
6) Consistent Method Error Handling and Notification
7) Which Data Warehouse Approach:
There are mainly two DW approaches - Top Down and Bottom Up. The selected approach will deeply effect the DW life cycle.
Top Down: Build Enterprise Data Warehouse First and feed the Data Marts later.
Bottom Up: Build the Data Marts first and then Enterprise Data Warehouse from the DMs as and when needed.
8) ETL Tool Selection:
A very crucial step specially for a new Data Warehouse initiative. A comparison of ETL tools can be performed providing weightage against various features that the ETL tool support. There could be a case when one ETL tool performs great in a specific feature but then we don't need that feature in our DW so in that case our tool selection may not heavily depend on that feature and we may assign less weightage against that. This way we can create a matrix and make a decision based on over all scoring.
9) Master Data Management:
(The aim of the Data integration is to provide 360 degree view of the enterprise data. Data integration would be necessary when the end reports from the BI system would include data elements from various source systems.
For the context of our ETL environment, data integration takes the form of conforming dimensions and conforming facts in the data warehouse. Conforming dimensions means establishing common dimensional attributes across separated fact tables so that “drill across” reports can be generated using these attributes. Conforming facts means making agreements on common business metrics such as key performance indicators (KPIs) across separated databases so that these numbers can be compared mathematically for calculating differences and ratios.)
Reference:
http://www.computerweekly.com/tip/Six-ETL-best-practices-followed-by-Shoppers-Stop
https://www.timmitchell.net/etl-best-practices/
http://www.dataintegration.info/etl
Features for consider while selecting ETL tool-
- Easy to Use
- Large Volume Performance
- Built-in Schedular
- Non-RDBMS Connections
- Join Multiple Sources (Heterogeneous)
- Deploy in Cloud Option
- Complex Transformation
- System Platform Supported
- SCD Support
- Debugging Facility
- CDC Support
- Command Line Operations
- Job Controlling and Scheduling
- Commercial or Open Source
- Built-in Version Control