Data Warehousing Concepts: What That Means and How it Will Help You Succeed

DatabaseIf you’re at all familiar with relational database management systems (RDBM’s) then you understand how important using one of these systems, or a data warehouse, can be. Data collection and analysis are extremely vital to the success of your business, organization or project. A data warehouse is a RDBM, but it’s designed for reporting and analysis as opposed to transactions. It separates these two functions which allows for maximum organization and understanding of the data collected.  Data warehousing concepts will help you understand how to organize, filter, and make the most out of your data. For more about building a useful database, or how to operate your existing database, check out this introduction to database design course.

Database Warehousing

The definition as made by Bill Inmon, a computer scientist who is dubbed as being the father of the data warehouse, is that database warehousing is a subject-oriented, integrated, non-volatile and integrated system that is used for data analysis. Let’s break this down.

Subject Oriented

A data warehouse is a sort of data repository that allows data from say, your online retail business, to be collected and then sorted. We can sort through this data and summarize our findings by subject. I.e. How many blue V-neck T-shirts did we sell last March? This kind of question can be found because of the subject oriented nature of a data warehouse.

Integrated

All of the data collected, regardless of the source, is integrated in a data warehouse. So, if you have data from one website and data from another website, this data is put together in the data warehouse so you can sort through it all.

Time-Variant

This is your record keeping feature. Because a data warehouse is not transaction based (which usually only holds the most recent data collected) you can go back a year, two years, one month or three months – you have access to these records and can analyze them until your heart’s content.

Non-volatile

Simply put, whatever data is entered into the warehouse, won’t change. It will never be different than it was when it was received by the warehouse and it won’t go away unless manually deleted

Data warehousing concepts

Dimensional Data Model

This is one of, if not the, most common data model used in data warehousing. It’s different than the 3rd normal model as data is stored differently and isn’t used for transactional type systems. A dimension is a category of information and an attribute is a unique level within a dimensions. The hierarchy specifies the levels within the dimension, so, like Month > Day > Hour. A dimensional model is made up of fact tables and lookup tables. A fact table harbors measurable data, like, the price of a product. The lookup table provides detailed information about the attributes. Fact tables are connected to one or more lookup tables but they do not have relationship with other fact tables.

Physical Data Model

This model includes all table structures, so the column name, data type and constraints that will be built in addition to foreign and primary keys. The steps to build a physical data model includes converting entities into tables, converting relationships into foreign keys (FK), converting attributes into columns and modifying the model based on physical requirements. The features that make the physical data model what it is are things like foreign keys (used to identify relationships) and specification of all the tables and columns within the data warehouse. This data model will be different between different RDBM’s.

Logical Data Model

The logical data model differs from the physical data model in that the entity names are now table names, attributes are now columns, and the data type for each column is specified. It’s a more detailed approach to data warehousing and is more complex because of it. The logical data model includes all entities and the relationships between them, all attributes are specified, the primary key (PK) for each entity is identified and normalization occurs, whereas denormalization occurs at the physical data model level.

Data Integrity

This is basically a fancy way of saying the “validity of data”. It just means that data is consistent and accurate. I mean, what good would your data be if it were not time-variant and volatile? Your reports would be off and so would your analysis. You can enforce data integrity in the data warehouse by using referential integrity, the primary key, valid values and by understanding NULL. Referential integrity maintains the relationship between the primary key of one table and the foreign key of another. It’s a checks and balances system. The primary key and the unique constraint are used to make sure that all data is unique in a table (so, no duplicate entries). Valid values just means that only data that is permitted is collected and analyzed. You have control over what is or is not permitted. NULL just means that when you identify columns as NOT NULL, they can’t have a NULL value. Pretty straight forward.

OLAP

On-Line Analytical Processing (OLAP), is a programming approach to totally analyze your information from many points of views. In fact, there is something called the FASMI test which stands for Fast Analysis of Shared Multidimensional Information which can be really useful for your business. Using this, you can specifically answer questions like; what product is the most popular? Which product is most likely to rise in what region? For marketing purposes, OLAP is a great tool. OLAP can be broken down into MOLAP, ROLAP and HOLAP.

MOLAP, ROLAP, HOLAP

So, it was mentioned above that OLAP has a multidimensional approach (MOLAP) to analyzing information, right? Well it also has a relational approach (ROLAP) and a hybrid approach (HOLAP), the latter of which combines MOLAP and ROLAP. In MOLAP technology, information is stored in multidimensional cubes which allows for fast data retrieval. In ROLAP, the processing is a bit slower because each search is essentially a SQL query. (Not sure what an SQL query is? That’s okay! Learn what you should know about structured query languages in SQL Server Essentials.) The benefit of ROLAP, though, is its ability to hold massive amounts of information in a relational database that functions similar to standard OLAP processing. It showcases the familiar slicing and dicing functionality – kind of like adding the WHERE clause in an SQL statement. MOLAP doesn’t have unlimited data holding capabilities because all of the calculations are performed when the multidimensional cube is built making it non-expansive. HOLAP processing offers the speed of MOLAP with the unlimited data features of ROLAP making it the best of both worlds.

Junk dimension

By using junk dimension concepts we can combine indicator fields into a single dimension. It simplifies things by adding a junk dimension which holds several indicator fields that creates its own fact table. For example, if we have a fact table that has 5 dimensions, three of which are indicators. We can replace the three indicators with a dimension called JUNK_ID. When we open the junk dimension, the three indicators we replaced would be within, like below.

      Fact Table

CUSTOMER_ID

PRODUCT_ID

PREPAY_IND

TAX_IND

COUPON_IND

      Fact Table

CUSTOMER_ID

PRODUCT_ID

JUNK_ID

DIM_JUNK

JUNK_ID

PREPAY_IND

TAX_IND

COUPON_IND

1

1

1

1

2

2

2

2

3

3

3

3

Data warehousing may seem complex, but with a little bit of studying and easy online tutorials like Working with Big Data you can learn how to make the most out of the data you’re collecting.