Foundation of Data Science: Unit I: Introduction

Data Warehousing

Characteristics, Multitier Architecture, Needs, Benefits, Metadata

Data warehousing is the process of constructing and using a data warehouse.

Data Warehousing

• Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed by integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad hoc queries and decision making. Data warehousing involves data cleaning, data integration and data consolidations.

• A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision-making process. A data warehouse stores historical data for purposes of decision support.

• A database an application-oriented collection of data that is organized, structured, coherent, with minimum and controlled redundancy, which may be accessed by several users in due time.

• Data warehousing provides architectures and tools for business executives to systematically organize, understand and use their data to make strategic decisions.

• A data warehouse is a subject-oriented collection of data that is integrated, time-variant, non-volatile, which may be used to support the decision-making process.

• Data warehouses are databases that store and maintain analytical data separately from transaction-oriented databases for the purpose of decision support. Data warehouses separate analysis workload from transaction workload and enable an organization to consolidate data from several source.

• Data organization in data warehouses is based on areas of interest, on the major subjects of the organization: Customers, products, activities etc. databases organize data based on enterprise applications resulted from its functions.

• The main objective of a data warehouse is to support the decision-making system, focusing on the subjects of the organization. The objective of a database is to support the operational system and information is organized on applications and processes.

• A data warehouse usually stores many months or years of data to support historical analysis. The data in a data warehouse is typically loaded through an extraction, transformation and loading (ETL) process from multiple data sources.

Databases and data warehouses are related but not the same.

• A database is a way to record and access information from a single source. A database is often handling real-time data to support day-to-day business processes like transaction processing.

• A data warehouse is a way to store historical information from multiple sources to allow you to analyse and report on related data (e.g., your sales transaction data, mobile app data and CRM data). Unlike a database, the information isn't updated in real-time and is better for data analysis of broader trends.

• Modern data warehouses are moving toward an Extract, Load, Transformation (ELT) architecture in which all or most data transformation is performed on the database that hosts the data warehouse.

• Goals of data warehousing:

1. To help reporting as well as analysis.

2. Maintain the organization's historical information.

3. Be the foundation for decision making.

"How are organizations using the information from data warehouses ?"

• Most of the organizations makes use of this information for taking business decision like :

a) Increasing customer focus: It is possible by performing analysis of customer buying.

b) Repositioning products and managing product portfolios by comparing the performance of last year sales.

c) Analysing operations and looking for sources of profit.

d) Managing customer relationships, making environmental corrections and managing the cost of corporate assets.

Characteristics of Data Warehouse

1. Subject oriented Data are organized based on how the users refer to them. A data warehouse can be used to analyse a particular subject area. For example, "sales" can be a particular subject.

2. Integrated: All inconsistencies regarding naming convention and value representations are removed. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.

3. Non-volatile: Data are stored in read-only format and do not change over time. Typical activities such as deletes, inserts and changes that are performed in an operational application environment are completely non-existent in a DW environment.

4. Time variant : Data are not current but normally time series. Historical information is kept in a data warehouse. For example, one can retrieve files from 3 months, 6 months, 12 months or even previous data from a data warehouse.

Key characteristics of a Data Warehouse

1. Data is structured for simplicity of access and high-speed query performance.

2. End users are time-sensitive and desire speed-of-thought response times.

3. Large amounts of historical data are used.

4. Queries often retrieve large amounts of data, perhaps many thousands of rows.

5. Both predefined and ad hoc queries are common.

6. The data load involves multiple sources and transformations.

Multitier Architecture of Data Warehouse

• Data warehouse architecture is a data storage framework's design of an organization. A data warehouse architecture takes information from raw sets of data and stores it in a structured and easily digestible format.

• Data warehouse system is constructed in three ways. These approaches are classified the number of tiers in the architecture.

a) Single-tier architecture.

b) Two-tier architecture.

c) Three-tier architecture (Multi-tier architecture).

• Single tier warehouse architecture focuses on creating a compact data set and minimizing the amount of data stored. While it is useful for removing redundancies. It is not effective for organizations with large data needs and multiple streams.

• Two-tier warehouse structures separate the resources physically available from the warehouse itself. This is most commonly used in small organizations where a server is used as a data mart. While it is more effective at storing and sorting data. Two-tier is not scalable and it supports a minimal number of end-users.

Three tier (Multi-tier) architecture:

• Three tier architecture creates a more structured flow for data from raw sets to actionable insights. It is the most widely used architecture for data warehouse systems.

• Fig. 1.11.1 shows three tier architecture. Three tier architecture sometimes called multi-tier architecture.

• The bottom tier is the database of the warehouse, where the cleansed and transformed data is loaded. The bottom tier is a warehouse database server.

• The middle tier is the application layer giving an abstracted view of the database. It arranges the data to make it more suitable for analysis. This is done with an OLAP server, implemented using the ROLAP or MOLAP model.

• OLAPS can interact with both relational databases and multidimensional databases, which lets them collect data better based on broader parameters.

• The top tier is the front-end of an organization's overall business intelligence suite. The top-tier is where the user accesses and interacts with data via queries, data visualizations and data analytics tools.

• The top tier represents the front-end client layer. The client level which includes the tools and Application Programming Interface (API) used for high-level data analysis, inquiring and reporting. User can use reporting tools, query, analysis or data mining tools.

Needs of Data Warehouse

1) Business user: Business users require a data warehouse to view summarized data from the past. Since these people are non-technical, the data may be presented to them in an elementary form.

2) Store historical data: Data warehouse is required to store the time variable data from the past. This input is made to be used for various purposes.

3) Make strategic decisions: Some strategies may be depending upon the data in the data warehouse. So, data warehouse contributes to making strategic decisions.

4) For data consistency and quality Bringing the data from different sources at a commonplace, the user can effectively undertake to bring the uniformity and consistency in data.

5) High response time: Data warehouse has to be ready for somewhat unexpected loads and types of queries, which demands a significant degree of flexibility and quick response time.

Benefits of Data Warehouse

a) Understand business trends and make better forecasting decisions.

b) Data warehouses are designed to perform well enormous amounts of data.

c) The structure of data warehouses is more accessible for end-users to navigate, understand and query.

d) Queries that would be complex in many normalized databases could be easier to build and maintain in data warehouses.

e) Data warehousing is an efficient method to manage demand for lots of information from lots of users.

f) Data warehousing provide the capabilities to analyze a large amount of historical data.

Difference between ODS and Data Warehouse

Metadata

• Metadata is simply defined as data about data. The data that is used to represent other data is known as metadata. In data warehousing, metadata is one of the essential aspects.

• We can define metadata as follows:

a) Metadata is the road-map to a data warehouse.

b) Metadata in a data warehouse defines the warehouse objects.

c) Metadata acts as a directory. This directory helps the decision support system to locate the contents of a data warehouse.

• In a data warehouse, we create metadata for the data names and definitions of a given data warehouse. Along with this metadata, additional metadata is also created for time-stamping any extracted data, the source of extracted data.

Why is metadata necessary in a data warehouse ?

a) First, it acts as the glue that links all parts of the data warehouses.

b) Next, it provides information about the contents and structures to the developers.

c) Finally, it opens the doors to the end-users and makes the contents recognizable in their terms.

• Fig. 1.11.2 shows warehouse metadata.


Foundation of Data Science: Unit I: Introduction : Tag: : Characteristics, Multitier Architecture, Needs, Benefits, Metadata - Data Warehousing