CIS8008 -Data Warehouse Vs Transactional Database
Task 1.1 Data warehouse, Data warehouse Vs transactional Database
Data warehouse is one of the information system which stores historical as well as commutative data either from single or more than one source. The main objective of data warehouse is for analysing, reporting, integrating transaction data from multiple sources.
Difference: Data warehouseVs Transactional Database
- Data warehouse use O-On L-line A-Analytical P-Processing -OLAP and Database follows O-On L-line T-Transactional P-Processing –OLTP
Data warehouse optimize with OLAP for handling less number of complex queries with the aggregation of more historical data sets. Data mining techniques uses OLAP applications. OLAP is used to analyse data (Inmon et al 200). Simple data warehouse has tables as well as joins that are de-normalized. The functionality of OLAP is query management system. The data are not modified frequently. It does not have integrity issue. It takes minutes to responds.
Databases are optimized for maximizing the speed as well as efficiency on data base transaction such as add, modify, and delete and analysis. Databases follows OLTP to insert, update, delete more numbers of online transactions, The main use of OLTP is to provide fast query processing, data integrity maintenance in multi access environment. The performance is measured by the number of transaction completed per second. OLTP database is current real world data which is detailed. Entity model is used to store transactional database. It takes milli second to respond.
- Data structure
Transactional data base follows normalization. It involves reorganization of data that it does not include redundant data. Tables store related data. Databases consume less disk spaces due to normalization. It needs more complex queries to get data from database.
Data warehouse follows de-normalization technique. It has fewer tables for groping data. It has redundancy. It increases performance for performing OLAP. It has data those are subject based.
Task 1.2 Three Main types of Data warehouse
Three main types of data warehouse are as follows
- E-Enterprise D-Data W-Warehouse –EDW
The warehouse is centralized to offer decision support across enterprise. Organization and representation of data follows unified approach. It has the ability for classification of data according to subject (Surajit Chaudhuri et al 1997). It gives overview of object in data model through wrangling and identifying data from multiple sources. The information is stored in consistent as well as conformed model. After collecting data from multiple sources, it acts as single location to provide access where various tools can be applied on data base to perform analytical processing as well as predictions.
- O-Operational D-Data S-Store-ODS
ODS is also centralised database that offers snapshot of latest data from various transactional system. It has latest up to date which follows current status of data. It does not have historical data. The data of ODS refreshed due to the OLTP source of data. The data of ODS is volatile. It provides decision for operational management. It has detailed data. The granularity is not same as source OLTP system.
- D-Data M-Mart-DM
It is subset of data warehouse. It is particularly designed for specific type of business such as finance, sales and so on. it is independent which gathers data from sources directly. It is used to enhance user response as well as reduce data volume for analysis. The implementation of data mart is easy and cost effective than data warehouse. It gives chances for change. Specific area expert can design its structure as well as configuration. The data can be partitioned and granularity can be controlled. The types of Data marts are dependent, independent and hybrid.
Task 1.3 Data Lake, Data Lake Vs Data warehouse
Data lake and data warehouse both are used to store big data. Data lake is pool of raw data it does not have specific purpose. It includes structured, unstructured as well as semi structured data. The data follows their native format with no limit for file size (Ayman Alserafi et al 2016).It enables organization to store data in cost effective way for further processing. Analyst mainly focuses to extract meaning knowledge from Data Lake not data.
Data warehouse is repository of structured data those are gathered and processed for specific purpose. Data warehouse use hierarchical structure in which folders and files arranged in hierarchical manner.
Data Lake follows flat architecture. Each data of Data Lake is tagged with unique identifier and Meta data.
Data Lake can handle big data challenges such as volume, variety as well as velocity. It is scalable. It supports structured, unstructured and semi structured data. It supports IoT. Data can be stored efficiently without any error due to its unstructured nature. HDFS based storage gives more flexibility to store large volume of data. It is scaled horizontally. It can acquire high velocity of data. It utilizes the tools like Kafka, Scribe, and Flume for acquiring high velocity data. It can acquire data from Whatspp, Twitter
Data warehouse has structured data. It has Scalabilityissue. Data warehouse tools are expensive to handle high volume of data.
Data ware house uses cubes as predetermined model to store data. Data lakes do not follow any predetermined model to store data (Jeffrey Dean et al 2008). It gives flexibility to store and perform analytics process on those data. Data warehouse follows schemas to store data. It takes time to organize data. Data Lake do not follow any scheme. It adopts Hadoop’s simplicity to store data seamlessly and schema oriented read models. Data Lake is useful for exploratory analysis. When need arises data can be stored in structured SQL and can be reused using PL SQL scripts. HAWA as well as IMPALA gives more flexibility.
Data Lake and data warehouse is varying in data structure. Data Lake has raw data whereas Data warehouse has processed data. Data Lake consumes more space due to unprocessed data structure. Raw data does not provide data quality as well as governance measures. Date warehouse includes processed data. It consumes less storage than Data Lake. It does not have data that never used. Processed data is easily understood for audience than raw data.
The data of Data Lake is not purpose specific. The gathered data will be utilized in future. It does not have any pre-processing technique to organize and store data.
Data warehouse has processed data. It data are specific need oriented. The storage space is not getting wasted due to utilization of data efficiently.
Data Lake requires specialised tools as well as analyst to analyse and interpret those unstructured as well as un-processed data. Data Lake is not well established prior. When data is in need then it will be transformed.
Data warehouse does not have much technical knowledge to analyse the data. The person with the topic knowledge in an organization can utilize those data. Those data can be interpreted with spread sheet, tables, and charts and so on.
The data warehouse is established well and it is proven solution. SQL server is widely and family used toolset for accessing data warehouse. The data warehouse maintenance can be performed internally with in organization. It delivers good performance due to its structured data. Query can easily process those data and return data.