Sizing considerations for BI Analytic Applications Implementation
Every time I hear people talking about sizing for their BI Implementations, I am tempted to ask what exactly do they mean under 'sizing', what kind of recommendations and suggestions they would like to get back. Why cannot someone (BI solution providers, for example) come up with a magic formula or an Excel spreadsheet, which produce the most accurate estimates? Well, I can assure you that various consulting businesses do use actively such spreadsheets and templates and they do make it big deal to provide sizing recommendations. And I agree that sizing is a serious consulting project, though I have my opinion about templates and spreadsheets.
So, let's look into BI Sizing in more details. Sizing, or more correctly referred as Capacity Planning, is the first phase in BI Implementation, and its primary objective is to identify the following:
- hardware specs for a future Data Warehouse hardware using generic metrics such as memory, CPU cores and storage
- ETL / ELT tier hardware specs, again RAM, CPU cores, storage
- BI Reporting tier hardware specs, RAM, CPU cores and storage
Let's list the major factors which one need to count in when working on such capacity planning:
- Initial ETL volumes, or amount of historic data, planned to be loaded into the future warehouse - these will define the footprint for initial deployment and the initial volumes will also affect incremental ETL and reporting, of course.
- Incremental ETL volumes, which will define daily ETL processing windows
- # of Functional Areas, planned to be implemented. The more areas you implement, the more complex data model you end up with, and the more complex and diverse reports you produce.
- # of Source Databases, used in ETL extracts for loading into the warehouse - yes, it's a big difference processing single source data vs. multiple sources.
- Source Database hardware specs, source DB configuration, and its workload during ETL executions - these items will have direct impact on your ETL extracts performance
- Data Warehouse (Target) hardware specs and Warehouse DB configuration
- ETL tier hardware specs
- LAN / WAN factor - it's critical to host ETL and Target tiers in the same LAN as your source DBs.
- BI Reporting tier hardware specs
- Any resource sharing on Source / ETL / Target hardware (sharing with other DBs, virtualization, shared storage, etc).
- Number of named (total) users, working with BI Reports
- Number of concurrent end users, generating workload during reporting windows
- Number of power users, crunching the numbers from their ad-hoc reports (OBIEE Answers users, for example)
- Caching effectiveness in ETL and BI Reports
- Human factor - BI Admins, DBA, performance gurus skills and experience, any omissions and oversights - all of these do play critical role in BI performance.
So, that's sizing / capacity planning. If you intend to provide a magic formula, make sure to factor in all these items for your magic to work.
I am advocating more pragmatic approach for capacity planning, based on common sense and experience from the past implementations:
- Make sure you procure easily scalable hardware for your Target tier. For example, you can start with 1/4 rack of Exadata, and add more as needed.
- Consider cluster deployments for your BI Reporting Tools. For example, you can build OBIEE Cluster, and add more boxes later on. The same applies to ETL tools - for example, you have an option to configure Informatica Load Balancing via Oracle DAC or license Informatica PowerCenter partitioning option.
- If you can afford it, always budget time for executing a test POC to load 1-2 years of historic data into your test warehouse. Then you would be able to extrapolate the amount of space, needed for your Production.
- Make sure you allocate sufficient TEMP space for your ETL processing.
- Remember, that usually ETL is designed to maximize the resources on the target to process ETL jobs as fast as possible. And yes, it's a bad idea to have end users in your warehouse during ETL. So, you can colocate ETL and BI Reporting on the same machine, as they will utilize the hardware without any overlaps (unless, of course you intend to provide high availability option).
- Do the basic math on estimating the amount of memory on the target tier to accommodate for your estimated end user concurrency. A typical BI query may process much larger data compared to OLTP transaction, so a single Oracle database session may easily consume 100-200M of PGA for processing heavy time series reports.
- I/O configuration is another important factor (ironically it's frequently forgotten in capacity planning discussions). If you allocate network attached storage from a busy filer, you would need to adjust your expectations for performance accordingly.
- Virtual machines vs. real hardware: virtual configurations by definition assume hardware resource sharing. So, factor it in when you consider using VMs in your BI implementation.
- Make sure you have a sound performance monitoring for your ETL and BI reporting. With Troype APE Diagnostics you will get the most powerful tools for monitoring ETL performance and OBIEE end user reports and dashboards, analyzing and mapping database workload events, accumulating historic runtime performance statistics for identifying any critical trends and responding to them in timely fashion. We carefully track each environment topology (ref. pic1), and configurations (pic2) and their workloads too (pic3).