Thursday, February 25, 2010

Data Warehouse Design & ETL Concepts

Data Warehouse Design
Data warehouses and data marts are built on dimensional data modeling where fact tables are connected with dimension tables. This is most useful for users to access data since a database can be visualized as a cube of several dimensions.



Dimensional Data Modeling
Dimensional Data Modeling comprises of one or more dimension tables and fact tables.
Dimensional Data Modeling is used for calculating summarized data. For example, sales data could be collected on a daily basis and then be aggregated to the week level, the week data could be aggregated to the month level, and so on.



Dimension Tables
Dimension table is one that describes the business entities of an enterprise, represented as hierarchical, categorical information such as time, departments, locations, and products. Dimension tables are sometimes called lookup or reference tables.
For example, Product dimension table will store information about products(Product Category, Product Sub Category, Product and Product Features) and location dimension table will store information about location( country, state, county, city, zip.



Fact Tables
The centralized table in a star schema is called as FACT(measures) table. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.
For example Sales Amount can be the measure for any business.



Measure Types
Additive - Measures that can be added across all dimensions.
Non Additive - Measures that cannot be added across all dimensions.
Semi Additive - Measures that can be added across few dimensions and not with others.

Star Schema
Star Schema is a relational database schema for representing multidimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions. The center of the star schema consists of a large fact table and it points towards the dimension tables. The advantage of star schema is slicing down, performance increase and easy understanding of data.





Snowflake Schema
A snowflake schema is a term that describes a star schema structure normalized through the use of outrigger tables. Dimension table hierarchies are broken into simpler tables.






ETL Concept
Extraction, transformation, and loading. ETL refers to the methods involved in accessing and manipulating source data and loading it into target database.
The first step in ETL process is mapping the data between source systems and target database(data warehouse or data mart). The second step is cleansing of source data in staging area. The third step is transforming cleansed source data and then loading into the target system.
Note that ETT (extraction, transformation, transportation) and ETM (extraction, transformation, move) are sometimes used instead of ETL.



Glossary of ETL
Source System
A database, application, file, or other storage facility from which the data in a data warehouse is derived.
Mapping
The definition of the relationship and data flow between source and target objects.
Metadata
Data that describes data and other structures, such as objects, business rules, and processes. For example, the schema design of a data warehouse is typically stored in a repository as metadata, which is used to generate scripts used to build and populate the data warehouse. A repository contains metadata.
Staging Area
A place where data is processed before entering the warehouse.
Cleansing
The process of resolving inconsistencies and fixing the anomalies in source data, typically as part of the ETL process.
Transformation
The process of manipulating data. Any manipulation beyond copying is a transformation. Examples include cleansing, aggregating, and integrating data from multiple sources.
Transportation
The process of moving copied or transformed data from a source to a data warehouse.
Target System
A database, application, file, or other storage facility to which the "transformed source data" is loaded in a data warehouse.

1 comment:

  1. Wonderful bloggers like yourself who would positively reply encouraged me to be more open and engaging in commenting. So know it's helpful..

    Informatica Training in Chennai

    Dataware Housing Training in Chennai

    ReplyDelete