Sunday, April 7, 2013

Online Analytical Processing (OLAP) Concepts

  • Online Analytical Processing OLAP: is an approach to swiftly answer multi-dimensional analytical (MDA) queries.
    • Broader category of business intelligence, which encompasses relational reporting and data mining.
    • Databases configured for OLAP use a multidimensional data model, allowing complex analytical and ad-hoc queries with a rapid execution time.
      • Ad-hoc ("for this"): allows users to create specific, customized queries.
    • OLAP tools enable users to interactively analyze multidimensional data from multiple perspectives.
    • The core of any OLAP system is an OLAP cube, which consists of numeric facts called measures  (derived from the record in the fact table) which are categorized by dimensions (derived from the dimension tables). The cube metadata is typically created from a star-schema or snowflake schema of tables in a relational database.
      • Consists of 3 analytical operations: consolidation, drill-down, slicing and dicing.
        • Consolidation: involves the aggregation of data that can be accumulated and coputed in one or more dimensions. i.e. all sales offices are rolled up to the Sales Department.
        • Drill-Down: allows users to navigate through the details.
        • Slicing and dicing: feature whereby users can take out (slicing) a specific set of data of the cube and view (dicing) the slices from different viewpoints.
    • Multidimensional (MOLAP): is the classic form of OLAP and is sometimes referred to as just OLAP.
      • Stores data in an optimized multi-dimensional array storage, rather than in a relational database.
      • Requires the pre-computation and storage of information in the cube - the operation known as processing.
    • Relational (ROLAP): works directly with relational databases. The base data and the dimension tables are stored as relational tables and new tables are created to hold the aggregated information.
      • Each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
    • Hybrid (HOLAP): there's no clear agreement that constitutes HOLAP, except that a database will divide data between relational and specialized storages.
      • i.e. some vendors, a HOLAP database will use relational tables to hold the larger quantities of detailed data, and use specialized storage for at least some aspects of the smaller quntities of more aggregate or less-detailed data.
    • Other types of OLAP:
      • WOLAP: Web-based OLAP
      • DOLAP: Desktop OLAP
      • RTOLAP: Real-Time OLAP
  • Top Vendors:
    • Microsoft Corp.
    • Oracle - Hyperion Solutions Corp.
    • IBM - Cognos
    • Business Objects
    • MicroStrategy
    • SAP AG
    • Applix
    • Infor

Data Warehousing concepts: Kimball vs. Inmon vs. Hybrid vs. Vault

  • Ralph Kimball - bottom-up design: approach data marts are first created to provide reporting and analytical capabilities for specific business processes.
    • Data marts - contain, primarily, dimensions and facts.
      • Facts - either contain atomic (detailed) data, and, if necessary, summarized data.
      • Single data mart often models a specific business area (unit) i.e. "Sales," "Production."
    • Integration of data marts is managed through implementation is called: "data warehouse bus architecture."
    • Data Warehouse bus architecture: is an implementation of "the bus," a collection of conformed dimensions and conformed facts, which are dimensions that are shared (in a specific way) between facts in two or more data marts.
      • Conformed dimensions - is a set of data attributes that have been physically implemented in multiple database tables using the same structure, attributes, domain values, definitions and concepts in each implementation.
      • Integration Points(between data marts) - integration of the data marts in the data warehouse is centered on the conformed dimensions (residing in "the bus").
      • Drill across process - integration of two or more data marts process.
        • Works by grouping (summarizing) the data long the keys of the (shared) conformed dimensions of each fact participating in the "drill across" followed by a join on the keys of these grouped (summarized) facts.
      • Maintaining tight management over the data warehouse bus architecture is fundamental to maintaining the integrity of the data warehouse.
      • Important management task is making sure dimensions among data marts are consistent or "conform."
    • Advantage: data warehouse ends up being "segmented." into number of logically self contained (up and including The Bus) and consistent data marts, rather than a big and often complex centralized model.
      • Business value can be returned as quickly as the first data marts can be created.
      • i.e. data warehouse can start from "Sales" department, by building a Sales-data mart. And if business is expanded into "Production" department, then Production-data mart can be integrable, because they share the same "BUS."
      • If integration via the bus is achieved, the data warehouse, through its two data marts, will only be able to deliver the specific information that the individual data marts are designed to do, but integrated "Sales-Production" information, which, often is of critical business value.
  • Bill Inmon - top-down design: 1st author on the subject of data warehouse, as a centralized repository for the entire enterprise. Data warehouse design using normalized enterprise data model.
    • Inmon's vision the data warehouse is at the center of the "Corporate Information Factory (CIF)," which provides a logical framework for delivering business intelligence (BI) and business management capabilities.
    • Atomic data, data at the lowest level of detail, are stored in the data warehouse.
    • Dimensional data marts: are created containing data needed for specific business processes or department from the data warehouse.
    • Data warehouse is:
      • Subject oriented: the data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together.
      • Non-volitile: data in the data warehouse are never over-written or deleted - once committed, the data are static (do not change), read-only, and retained for future reporting.
      • Integrated: data warehouse contains data from most or all of an organization's operational systems and these data are made consistent.
      • Time variant: top-down design methodology generates highly consistent dimensional views of data across data marts, since all data marts are loaded from the centralized repository.
        • Top-down: design is robust against business changes.
        • Generating new dimensional data marts against the data stores in the data warehouse is a relatively simple task.
    • Disadvantage: top-down  design represents a very large project with a very broad scope.
      • Upfront cost for implementing a data warehouse is significant, and the duration of time from the start of project to he point that end users experience initial benefits can be substantial.
      • Inflexible and unresponsive to changing departmental needs during the implementation phases.
  • Hybrid design: data warehouse solutions often resemble hub and spoke architecture. Legacy systems feeding the DW/BI solution often include CRM and ERP, generating large amounts of data. To consolidate these various data models, and facilitate the ETL process, DW solutions often make use of an operational data store (ODS). The information then parsed into the actual DW.
    • Operational Data Store: is  database designed to integrate data from multiple sources for additional operations on the data)
    • To reduce redundancy, large systems will often store data in a normalized way. Data marts for specific reports can then be built on top of the DW solution.
    • Important to note that DW database in a hybrid solution is kept on 3d normal form to eliminate data redundancy. Although normal relational database is not efficient for BI reports.
    • DW effectively provides a single source of information from which the data marts can read, creating a highly flexible solutions from a BI point of view.
    • Hybrid architecture: allows DW to be replaced with a master data management solution where operational, not static information could reside.
      • Master data management: comprises a set of processes and tools that consistently defines and manages the non-transactional data entities of an organization (which may include reference data).
  • Data Vault Modeling: is a hybrid design, consisting of the best of breed practices from both 3rd normal form and star-schema.
    • Data Vault model is not a true 3rd normal form, and breaks some of the rules that 3NF dictates be followed.
    • It is a top-down architecture with bottom-up design, geared to be strictly a data warehouse.
    • It is not geared to be end-user accessible, which when built, still requires the user of a data mart or star-schema based release are for business purposes.

Logical vs. Physical Dimensional Data Models

  • Logical Data Model: suits business requirements, creates and maintains the lookup data, compares the versions of data model, maintains change log, and generate reports from data model.
    • Logical Data Model:
      • Represents business information and defines business rules
      • Entity
      • Attribute
      • Primary Key
      • Alternate Key
      • Inversion Key Entry
      • Rule
      • Relationship
      • Definition
  • Physical Data Model: defines the source and target database properties. Should know the technical-know-how to create data models from existing databases and to tune the data models with referential integrity, alternate keys, indexes, and how to match indexes to SQL code. It would be good if the physical data modeler knows about replication, clustering, and so on.
    • Physical Data Model:
      • Represents the physical implementation of the model in a database.
      • Table
      • Column
      • Primary Key Constraint
      • Unique Constraint or Unique Index
      • Non Unique Index
      • Check Constraint, Default Value
      • Foreign Key
      • Comment

Feature
Conceptual
Logical
Physical
Entity Names

Entity Relationships

Attributes


Primary Keys

Foreign Keys

Table Names


Column Names


Column Data Types




Saturday, April 6, 2013

Logical vs. Physical Dimensional Data Models

Logical Dimensional Data Models
Logical data models address the following areas of interest:
1. Validating application models against business requirements
2. Creating requirements for physical data models and database design
3. Identifying business entities and the relationships between the entities

Logical data models create a single view of all data. You can create a logical data model to address performance, consistency, and redundancies in your data. You use the logical data model to create a physical data model that accesses the data.

Physical Dimensional Data Models
Keep in mind the following considerations when you create a physical data model:
1. How scalable is your design? How scalable is the physical database management system (DBMS)?
2. What queries, ETL processes, and other applications does the data warehouse require?
3. Is there an abstracted data model that you can use to improve performance?
4. How will you operate or maintain the data warehouse?