Forgot Password?

Introduction to Data Warehousing

 

Introduction

Over the last 20 years, $1 trillion has been invested in new computer systems to gain competitive advantage. The vast majority of these systems have automated business processes, to make them faster, cheaper, and more responsive to the customer. Electronic point of sales (EPOS) at supermarkets, itemized billing at telecommunication companies (telcos), and mass market mailing at catalog companies are some examples of such “Operational Systems”. These systems computerized the day-to-day operations of business organizations. Some characteristics of the operational systems are as follows:

Presently almost all businesses have operational systems and these systems are not giving them any competitive advantage. These systems have gathered a vast amount of “data” over the years. The companies are now realizing the importance of this “hidden treasure” of information. Efforts are now on to tap into this information that will improve the quality of their decision-making.
A “data warehouse” is nothing but a repository of data collected from the various operational systems of an organization. This data is then comprehensively analyzed to gain competitive advantage. The analysis is basically used in decision making at the top level.
From being just a passing fad, Data Warehousing technology has grown much in scale and reputation in the past few years, as evidenced by the increasing number of products, vendors, organizations, and yes books, even books, devoted to the subject. Enterprises that have successfully implemented data warehouses find it strategic and often wonder how they ever managed to survive without it in the past.
As early as 1995, a Gartner Group survey of Fortune 500 IT managers found that 90% of all organizations had planned to implement Data Warehouses by 1998.

 

Data Warehousing Systems

A data warehousing system can perform advanced analyses of operational data without impacting operational systems. OLTP is very fast and efficient at recording the business transactions - not so good at providing answers to high-level strategic questions.

Component Systems

Legacy Systems

Any information system currently in use that was built using previous technology generations.  Most legacy Systems are operational in nature, largely because the automation of transaction-oriented business process had long been the priority of IT projects.

Source Systems

Any system from which data is taken for a data warehouse. A source system is often called a legacy system in a mainframe environment.

Operational Data Stores (ODS)

An ODS is a collection of integrated databases designed to support the monitoring of operations. Unlike the databases of OLTP applications (that are function oriented), the ODS contains subject oriented, volatile, and current enterprise-wide detailed information. It serves as a system of record that provides comprehensive views of data in operational sources.
Like data warehouses, ODSs are integrated and subject-oriented. However, an ODS is always current and is constantly updated. The ODS is an ideal data source for a data warehouse, since it already contains integrated operational data as of a given point in time.
In short, ODS is an integrated collection of clean data destined for the data warehouse.

 

Definition

Data Warehouses are mostly populated with periodic migrations of data from operational systems. The second source is made up of external, frequently purchased, databases. Examples of this data would include lists of income and demographic information. This purchased information is linked with internal data about customers to develop a good customer profile.

A Data Warehouse is a

collection of data in support of management decisions.

Subject Oriented

OLTP databases usually hold information about small subsets of the organization. For example, a retailer might have separate order entry systems and databases for retail, catalog, and outlet sales. Each system will support queries about the information it captures. But if somebody wants to find out details of all sales, then these “separate” systems are not adequate. To address this type of situation, your data warehouse database should be subject-oriented, organized into subject areas like sales, rather than around OLTP data sources.

Flowchart: Magnetic Disk: Retail Sales System

                                                                                           

A data warehouse is organized around major subjects such as customer, products, sales, etc. Data are organized according to subject instead of application. For exmple, an insurance company using a data warehouse would organize their data by customer, premium, and claim instead of by different products (auto, life, property etc.).

Integrated

A data warehouse is usually constructed by integrating multiple, heterogeneous sources, such as relational databases, flat files, and OLTP files. When data resides in many separate applications in the operational environment, the encoding of data is often inconsistent. For example, in the above system, the retail system uses a numeric 7-digit code for products, the outlet system code consists of 9 alpha-numerics, and the catalog system uses 4 alphabets and 4 numerics. To create a useful subject area, the source data must be integrated. There is no need to change the coding in these systems, but there must be some mechanism to modify the data coming into the data warehouse and assign a common coding scheme.

 

Nonvolatile

Unlike operational databases, warehouses primarily support reporting, not data capture. A data warehouse is always a physically separate store of data. Due to this separation, data warehouses do not require transaction processing, recovery, concurrency control etc. The data are not updated or changed in any way once they enter the data warehouse, but are only loaded, refreshed and accessed for queries.

Time Variant
Data are stored in a data warehouse to provide historical perspective. Every key structure in the data warehouse contains, implicitly or explicitly, an element of time. A data warehouse generally stores data that is 5-10 years old, to be used for comparisons, trends, and forecasting.

 

Operational Systems vs Data Warehousing Systems


Operational

Data Warehouse

Holds current data

Holds historic data

Data is dynamic

Data is largely static

Read/Write accesses

Read only accesses

Repetitive processing

Adhoc complex queries

Transaction driven

Analysis driven

Application oriented

Subject oriented

Used by clerical staff for day-to-day operations

Used by top managers for analysis

Normalized data model (ER model)

Denormalized data model (Dimensional model)

Must be optimized for writes and small queries.

Must be optimized for queries involving a large portion of the warehouse.

 

 

Advantages of Data Warehousing

Problems with Data Warehousing

Data Warehouse Architecture*

A typical data warehousing architecture is illustrated below:

                                                                                           

 

DATA WAREHOUSE COMPONENTS & ARCHITECTURE

The data in a data warehouse comes from operational systems of the organization as well as from other external sources. These are collectively referred to as source systems.  The data extracted from source systems is stored in a area called data staging area, where the data is cleaned, transformed, combined, deduplicated to prepare the data for us in the data warehouse. The data staging area is generally a collection of machines where simple activities like sorting and sequential processing takes place. The data staging area does not provide any query or presentation services. As soon as a system provides query or presentation services, it is categorized as a presentation server. A presentation server is the target machine on which the data is loaded from the data staging area organized and stored for direct querying by end users, report writers and other applications. The three different kinds of systems that are required for a data warehouse are:

  1. Source Systems
  2. Data Staging Area
  3. Presentation servers

 

The data travels from source systems to presentation servers via the data staging area. The entire process is popularly known as ETL (extract, transform, and load) or ETT (extract, transform, and transfer). Oracle’s ETL tool is called Oracle Warehouse Builder (OWB) and MS SQL Server’s ETL tool is called Data Transformation Services (DTS).
A typical architecture of a data warehouse is shown below:

  

Each component and the tasks performed by them are explained below:
           

The sources of data for the data warehouse is supplied from:

The load manager performs all the operations associated with extraction and loading data into the data warehouse. These operations include simple transformations of the data to prepare the data for entry into the warehouse. The size and complexity of this component will vary between data warehouses and may be constructed using a combination of vendor data loading tools and custom built programs.

The warehouse manager performs all the operations associated with the management of data in the warehouse. This component is built using vendor data management tools and custom built programs. The operations performed by warehouse manager include:

In certain situations, the warehouse manager also generates query profiles to determine which indexes ands aggregations are appropriate.

The query manager performs all operations associated with management of user queries. This component is usually constructed using vendor end-user access tools, data warehousing monitoring tools, database facilities and custom built programs. The complexity of a query manager is determined by facilities provided by the end-user access tools and database.

This area of the warehouse stores all the detailed data in the database schema. In most cases detailed data is not stored online but aggregated to the next level of details. However the detailed data is added regularly to the warehouse to supplement the aggregated data.

The area of the data warehouse stores all the predefined lightly and highly summarized (aggregated) data generated by the warehouse manager. This area of the warehouse is transient as it will be subject to change on an ongoing basis in order to respond to the changing query profiles. The purpose of the summarized information is to speed up the query performance. The summarized data is updated continuously as new data is loaded into the warehouse.

This area of the warehouse stores detailed and summarized data for the purpose of archiving and back up. The data is transferred to storage archives such as magnetic tapes or optical disks.

The data warehouse also stores all the Meta data (data about data) definitions used by all processes in the warehouse. It is used for variety of purposed including:

The structure of Meta data will differ in each process, because the purpose is different. More about Meta data will be discussed in the later Lecture Notes.

The principal purpose of data warehouse is to provide information to the business managers for strategic decision-making. These users interact with the warehouse using end user access tools. The examples of some of the end user access tools can be:

THE E T L (EXTRACT TRANSFORMATION LOAD) PROCESS
In this section we will discussed about the 4 major process of the data warehouse. They are extract (data from the operational systems and bring it to the data warehouse),transform(the data into internal format and structure of the data warehouse),cleanse (to make sure it is of sufficient quality to be used for decision making) and load (cleanse data is put into the data warehouse).
The four processes from extraction through loading often referred collectively as Data Staging.

EXTRACT

Some of the data elements in the operational database can be reasonably be expected to be useful in the decision making, but others are of less value for that purpose. For this reason, it is necessary to extract the relevant data from the operational database before bringing into the data warehouse. Many commercial tools are available to help with the extraction process. Data Junction is one of the commercial products. The user of one of these tools typically has an easy-to-use windowed interface by which to specify the following:

 

TRANSFORM
The operational databases developed can be based on any set of priorities, which keeps changing with the requirements. Therefore those who develop data warehouse based on these databases are typically faced with inconsistency among their data sources. Transformation process deals with rectifying any inconsistency  (if any).
One of the most common transformation issues is ‘Attribute Naming Inconsistency’. It is common for the given data element to be referred to by different data names in different databases. Employee Name may be EMP_NAME in one database, ENAME in the other. Thus one set of Data Names are picked and used consistently in the data warehouse. Once all the data elements have right names, they must be converted to common formats. The conversion may encompass the following:

All these transformation activities are automated and many commercial products are available to perform the tasks. DataMAPPER from Applied Database Technologies is one such comprehensive tool.

CLEANSING

Information quality is the key consideration in determining the value of the information. The developer of the data warehouse is not usually in a position to change the quality of its underlying historic data, though a data warehousing project can put spotlight on the data quality issues and lead to improvements for the future. It is, therefore, usually necessary to go through the data entered into the data warehouse and make it as error free as possible. This process is known as Data Cleansing.
Data Cleansing must deal with many types of possible errors. These include missing data and incorrect data at one source; inconsistent data and conflicting data when two or more source are involved. There are several algorithms followed to clean the data, which will be discussed in the coming lecture notes.

LOADING

Loading often implies physical movement of the data from the computer(s) storing the source database(s) to that which will store the data warehouse database, assuming it is different. This takes place immediately after the extraction phase. The most common channel for data movement is a high-speed communication link. Ex: Oracle Warehouse Builder is the API from Oracle, which provides the features to perform the ETL task on Oracle Data Warehouse.

 

Data Warehouse Design

An introduction to Dimensional Modeling

Data Warehouses are not easy to build. Their design requires a way of thinking that is just opposite to manner in which traditional computer systems are developed. Their construction requires radical restructuring of vast amounts of data, often of dubious or inconsistent quality, drawn from numerous heterogeneous sources. Their implementation strains the limits of today’s IT. Not surprisingly, a large number of data warehouse projects fail. Successful data warehouses are built for just one reason: to answer business questions. The type of questions to be addressed will vary, but the intention is always the same. Projects that deliver new and relevant information succeed. Projects that do no, fail. [6]


To deliver answers to businesspeople, one must understand their questions. The DW design fuses business knowledge and technology know-how. The design of the data warehouse will mean the difference between success and failure.
The design of the data warehouse requires a deep understanding of the business. Yet the task of design is undertaken by IT professionals, but not business decision makers. Is it reasonable to expect the project to succeed? The answer is yes. The key is learning to apply technology toward business objectives.
Most computer systems are designed to capture data, data warehouses are designed to for getting data out. This fundamental difference suggests that the data warehouse should be designed according to a different set of principles.
Dimensional Modeling is the name of a logical design technique often used for data warehouses. It is different from entity-relationship modeling. ER modeling is very useful for transaction capture in OLTP systems.
Dimensional Modeling is the only viable technique for delivering data to the end users in a data warehouse.

Comparison between ER and Dimensional Modeling
The characteristics of ER Model are well understood; its ability to support operational processes is its underlying characteristic.  The conventional ER models are constituted to

In contrast, the dimensional model is designed to support the reporting and analytical needs of a data warehouse system.

Why ER is not suitable for Data Warehouses?

Introduction to Dimensional Modeling Concepts

The objective of dimensional modeling is to represent a set of business measurements in a standard framework that is easily understandable by end users. A Dimensional model contains the same information as an ER model but packages the data in a symmetric format whose design goals are

The main components of a Dimensional Model are Fact Tables and Dimension Tables. A fact table is the primary table in each dimensional model that is meant to contain measurements of the business. The most useful facts are numeric and additive. Every fact table represents a many to many relationship and every fact table contains a set of two or more foreign keys that join to their respective dimension tables.

A fact depends on many factors. For example, sale_amount, a fact, depends on product, location and time. These factors are known as dimensions. Dimensions are factors on which a given fact depends. The sale_amount fact can also be thought of as a function of three variables.
sales_amount = f(product, location, time)
Likewise in a sales fact table we may include other facts like sales_unit and cost.
Dimension tables are companion tables to a fact table in a star schema. Each dimension table is defined by it’s primary key that serves as the basis for referential integrity with any given fact table to which it is joined. Most dimension tables contain textual information.
To understand the concepts of facts, dimension, and star schema, let us consider the following scenario:
Imagine standing in the marketplace and watching the products being sold and writing down the quantity sold and the sales amount each day for each product in each store. Note that a measurement needs to be taken at every intersection of all dimensions (day, product, and store). The information gathered can be stored in the following fact table:

 

The facts are Sale_Unit, Sale_Amount, and Cost (note that all are numeric and additive), which depend on dimensions Date, Product, and Store. The details of the dimensions are stored in dimension tables.

Note the following points about the star schema:

The facts in a star schema are of the following three types:

The facts in the above schema are fully-additive.

 

Designing a Dimensional Model: Steps Involved

Step 1 - Select the Business Process
The first step in the design is to decide what business process (es) to model by combining an understanding of the business requirements with an understanding of the available data [8]

Step 2 - Declare the Grain
Once the business process has been identified, the data warehouse team faces a serious decision about the granularity. What level of detail must be made available in the dimensional model?  The grain of a fact table represents the level of detail of information in a fact table. Declaring the grain means specifying exactly what an individual fact table record represents. It is recommended that the most atomic information captured by a business process. Atomic data is the most detailed information collected. The more detailed and atomic the fact measurements are, the more we know and we can analyze the data better. In the star schema discussed above, the most detailed data would be transaction line item detail in the sale receipt.
(date, time, product code, product name, price/unit, number of units, amount)
18-SEP-2002, 11.02, p1, dettol soap, 15, 2, 30
But in the above dimensional model we provide sales data rolled up by product(all records corresponding to the same product are combined) in a store on a day. A typical fact table record would look like this:
18-SEP-2002, Product1, Store1, 150, 600
This record tells us that on 18th Sept. 150 units of Product1 was sold for Rs. 600 from Store1. [9]

Step 3 – Choose the Dimensions
Once the grain of the fact table has been chosen, the date, product, and store dimensions are readily identified. It is often possible to add more dimensions to the basic grain of the fact table, where these additional dimensions naturally take on only one value under each combination of the primary dimensions. If the additional dimension violates the grain by causing additional fact rows to be generated, then the grain must be revised to accommodate this dimension.

Step 4 – Identify the Facts
The first step in identifying fact tables is where we examine the business, and identify the transaction that may be of interest. In our example the electronic point of sale (EPOS) transactions give us two facts, quantity sold and sale amount.

 

Strengths of Dimensional Modeling
The dimensional model has a number of important data warehouse advantages that the ER model lacks[9]. Its strengths are:

The following graceful can be made to the design after the data warehouse is up and running:

Details about the above modeling situations to be provided in later article.

 

Snowflake and Starflake Schemas

In dimensional modeling the dimension tables are in denormalized form whereas fact tables are in normalized form.
Snowflaking is removing low cardinality (an attribute not having low distinct values to table cardinality ratio) textual attributes from dimension tables and placing them in secondary dimension tables. For instance, a product category can be treated this way and physically removed from the low-level product dimension table by normalizing the dimension table. This is particularly done on large dimension tables. Snowflaking a dimension means normalizing it and making it more manageable by reducing its size. But this may have an adverse effect on performance, as joins need to be performed.
If all the dimensions in a star schema are normalized, the resulting schema is called a snowflake schema and if only a few dimensions are normalized, we call it a starflake schema.

Multidimensional Databases and MOLAP

Business process is multidimensional in the sense that mangers ask questions about product sales in different regions over specific time periods.

Dimensions: Product, Region, Time period
Fact or Measure: Sale

 

An MDDB is a computer software system designed to allow for the efficient and convenient [10] storage and retrieval system of large volumes of data that is

  1. Intimately related &
  2. Stored, viewed and analyzed form different perspectives

These perspectives are called Dimensions.

A Motivating Example

An automobile manufacturer wants to increase sale volumes by examining sales data collected throughout the organization. The evaluation would require viewing historical sales volume figures from multiple dimensions such as

Analyzing sales volumes data from any one or more of the above dimensions can give answers to important queries such as:

What is the trend in sales volumes over a period of time for a specific model and color across a specific group of dealerships?

Consider the relation given below containing the manufacturer’s sales data:

 

SALES VOLUMES FOR GLEASON DEALERSHIP

 

The above matrix is a 2-D array. An array is a fundamental component of MDDBs.
In an array, each axis is called a dimension (MODEL & COLOR)
Each element in the dimension is called a position.
For model, 3 positions, van, sedan, and coupe.
For color, 3 positions, blue, white, and red.

Intersections of dimensions are called cells and are populated with the data of interest or measure or fact (sales).

Sedan sales are all “lined up” color-wise. Total sedan sales can be added very quickly. Similarly sales for each color are also lined up.

The relational structure tells us nothing about the possible contents of those fields

 

Increasingly Complex Relational Tables [11]
If we add a new field, dealers to the relational table, with three possible values, the relational table becomes even more awkward for presenting data to the end user.

SALES VOLUMES FOR ALL DEALERSHIPS

            MODEL                      COLOR                      DEALERSHIP            VOLUME
MINI VAN                  BLUE                          CLYDE                        6
MINI VAN                  BLUE                          GLEASON                   6
MINI VAN                  BLUE                          CARR                          2
MINI VAN                  RED                             CLYDE                        3
MINI VAN                  RED                             GLEASON                   5
MINI VAN                  RED                             CARR                          5
MINI VAN                  WHITE                        CLYDE                        2
MINI VAN                  WHITE                        GLEASON                   4
MINI VAN                  WHITE                        CARR                          3
SPORTS COUPE         BLUE                          CLYDE                        2
SPORTS COUPE         BLUE                          GLEASON                   3
SPORTS COUPE         BLUE                          CARR                          2
SPORTS COUPE         RED                             CLYDE                        7
SPORTS COUPE         RED                             GLEASON                   5
SPORTS COUPE         RED                             CARR                          2
SPORTS COUPE         WHITE                        CLYDE                        4
SPORTS COUPE         WHITE                        GLEASON                   5
SPORTS COUPE         WHITE                        CARR                          1
SEDAN                        BLUE                          CLYDE                        6
SEDAN                        BLUE                          GLEASON                   4
SEDAN                        BLUE                          CARR                          2
SEDAN                        RED                             CLYDE                        1
SEDAN                        RED                             GLEASON                   3
SEDAN                        RED                             CARR                          4
SEDAN                        WHITE                        CLYDE                        2
SEDAN                        WHITE                        GLEASON                   2
SEDAN                        WHITE                        CARR                          3

 

 

Multidimensional Simplification [9]
We just need to add a third axis or dimension called Dealers. The array now becomes 3-D  (3x3x3 with 27 cells). Earlier it was 2-D (3x3 with 9 cells). The array can now be thought of as a cube with 3 faces, with each face having 9 cells.
If we have a 10x10x10 array, with each of the three dimensions having 10 positions. In relational format, we will need 1000 records to represent this array.

 

Performance Advantages
Consider a 10x10x10 array. A user wants to find out the sales figure for blue colored sedan sold by Gleason dealer.  A relational system might have to search through all 1000 records just to find the qualifying records. The multidimensional system has to search only along three dimensions of 10 positions each to find the matching record. This is a maximum of 30 position searches for the array versus 1000 record searches for the table

Adding Dimensions
The 3D model can easily be extended to four dimensions by adding time dimension to indicate the month of the year in which sale was made.

 

Trade-Offs: MDDB vs. RDBMS [13]
Consider the following factors when choosing between the multidimensional approaches:

Recently, many of the large database vendors have announced plans to integrate their multidimensional and relational database products. In this situation, the end-users make use of the multidimensional front-end tools for all their queries. If the query requires data that are not available in MDDB, the tool will retrieve the required data from the larger relational database. This feature is called as drill-through.

The following table sums up the comparison between MDDBs and RDBMSs

MDDB

RDBMS

Data is stored in multidimensional arrays

Data is stored in relations

Direct inspection of an array gives a great deal of information

Not so

Can handle limited size databases (< 100GB)

Proven track record for handling VLDBs

Takes long to load and update

Highly volatile data are better handled

Support aggregations better

RDBMSs are catching up-Aggregate Navigators

New investments need to be made and new skill sets need to be developed

Most enterprises already made significant investments in RDBMS technology and skill sets

Adds complexity to the overall system architecture

No additional complexity

Limited no. of facts an dimensional tables

No such restriction

Examples

  • Arbor-Essbase
  • Brio Query-Enterprise
  • Dimensional Insight-DI Diver
  • Oracle-Express Server

Examples

  • IBM-DB2
  • Microsoft-SQL Server
  • Oracle-Oracle RDBMS
  • Red Brick Systems-Red Brick Warehouse
  • more about aggregation and aggregate navigation later in the course

Conclusion:

Data warehousing and Data Mining are two important components of business intelligence. Data warehousing is necessary to analyze (Analysis) the business needs, integrate (Integration) data from several sources, model (Data Modeling) the data in an appropriate manner to present the business information in the form of dashboards and reports (Reporting).

Reference: (Compiled from the following)

  1. Corey M et al., Oracle 8i Data Warehousing, TMH 2001
  2. Connolly T and Carolyn B, Database Systems, second edition, AW, 1998
  3. Ramakrishna R and Gehrke J, Database Management Systems, second edition, MGH, 2000
  4. http://cisnet.baruch.cuny.edu/holowczak/classes/9440/datawarehousing/
  5. http://system-services.com/ftp/dwintro.doc
  6. R. Kimball & M Ross, The Data Warehouse Toolkit, 2e, John Wiley, 2002.
  7. J.M. Firestone, Dimensional modeling and ER modeling in the Data Warehouse, http://www.dkms.com/DMERDW.html
  8. N Raden, Modeling the Data Warehouse,http://members.aol.com/nraden
  9. R. Kimball, A dimensional Modeling Manifesto, http://www.dbmsmag.com/9708d15.html
  10. Kenan Technologies, An introduction to Multidimensional Database Technology http://www. kenan.com
  11. M Humphries et al, Data Warehousing:  Architecture and Implementation, Prentice Hall PTR, 1999.
  12. S Anahory & D Murray, Data Warehousing in the Real World, Appendix B: Data Analysis Tools, pp 320-22.
  13. T B Pedersen & C S Jensen, Multidimensional Database Technology, Computer (IEEE), 2001, pp 40-46.