CIS8025 Big Data Management Technologies University of Southern Queensland- Australia
Task 1 Big data management technologies
Task 1.1 Hadoop
Hadoop is software tool which is utilized for managing and analysing big data in structured as well as unstructured form. The companies face book, yahoo, Twitter, Linkedln are utilizing Hadoop for their data management. The open source tool can run on multiple servers. The quality of the tool for managing big data has been improved over time. It analyses data from more sources (Ghorbanian, 2019).
The core components of Hadoop are H-Hadoop D-Distributed F-File S-System -HDFS, Hadoop Common, YARN, MapReduce, Hadoop Submarine and Hadoop Ozone.
The H-Hadoop D-Distributed F-File S-System –HDFS is the components useful for data distribution across system for storage. The distributed system is known as data node. The HDFS manages data node for efficient data storage (Gorelik, 2019).
The applications run concurrently on Hadoop framework. The components YARN responsible for managing application on Hadoop framework. It ensures that resources are distributed appropriately for running application. It also manages the application through scheduling.
The Mapreduce component of Hadoop is responsible for directing batch applications. It manages parallel execution of batch applications.
The component acts as resource for utilizing other component in the framework.
It provides technology for driving object store.
It drives machine learning.
It also uses other tools as part of Hadoop stack such as DBMS, Apache Hbase, and data management as well as application tools.
The basic feature of Hadoop is to collect data from multiple sources. The framework of Hadoop runs on various nodes which enable the framework for accommodating volume of data.
The Hadoop framework supports more volume data through the tools are running clusters of machines for accommodating the large volume of data. The tools based on Hadoop supports large of data through their nodes. There are storage units those are scaling horizontally for accommodating more data.
Hadoop in Bigdata
The Hadoop framework can support both structured as well as unstructured data. It accepts and manages data in raw form. Data exits all around the organization. The tool Hadoop efficiently collect data from various sources, process those data.
Hadoop is open source tool for processing large pool of data. Organization gave to spend only for the Hadoop add-ons. Apache software foundation’s controls Hadoop.
The software tool additional for Hadoop is Bundle. Hadoop offers flexibility for organization to collect, store, process and analyse large volume of Data.
The component Mapreduce efficiently performs data analysis process due to its compatible nature with other others. It supports various data analysis tools as well as programming lanaguages like Java, Ruby, Python etc.
Organizations that generate huge volume of data are in need of Hadoop and other similar platforms. The following organizations are utilizing the framework Hadoop such as
The data of facebook are organized into multiple components of Hadoop. Facebook status are store in MySQL, app facebook messenger run on Hbase (Ghorbanian, 2019).
Amazon efficiently utilizes Hadoop for data processing. The componentsElastic Mapreduce web service is utilizing for data processing such as analysis of log, data warehousing, indexing, bioinformatics, machine learning, financial data analysis, scientific simulations and so on.
ebay utilizes Hadoop components such as Java Mapreduce, Apache HBase, Apache Pig and Apache Hive for processing such as search optimizations and research.
It utilizes the components like Apache hadoop and Apache Hbase. The adobe production utilizes Hadoop components on clusters of 30 nodes.
Shortcomings of Hadoop
Hadoop framework has the following limitations such as
- Slow processing
- Batch processing
- Issues related with small files
HDFS is designed to support large volume of data. It lacks for handling small files. The files are less than the block size of HDFS (128 MB). The file system can handle large volume of data instead of smaller files. For handling more number of small files, the files can be merged and copied into HDFS.
The small file issues can be handled with H-Hadoop AR-Archives-HAR. It built layered file system on top of HDFS. HAR files are created through HAR command. It will run Mapreduce to pack the files into HDFS. HAR files is not efficient than HDFS (Liu, 2019).
The component Mapreduce processes huge volume of data with two phases such as Mapping and Reduce. It consumes more time for such tasks which increases latency. The issues can be overcome with the technologies Apache Spark and Flink. Apache Spark utilizes in memory processing which reduces the time for moving the data in and out of the disk. It is faster than MapReduce. Flink follows streaming architecture which faster than Spark.
Task 1.2 MapReduce
Mapreduce is one of the components of Hadoop to process huge volume of data. The framework can include application for processing large volume of data in parallel form with large clusters of computers in reliable form. The framework is designed for distributed computing which is written in Java. It works with two phases such as Map and Reduce. Map tasks deals with split and Map and Reduce task deals with shuffle and reduce (Prabhu, 2019).
Hadoop framework runs Mapreduce written in variety of languages such as Ruby, Python, Java, C++ etc. It runs in parallel with multiple machines in cluster. It is useful for large scale data analysis.
Mapreduce reduces the large volume of data into smaller parts for processing and executes those smaller parts of data in multiple clusters of server. The outputs gathered from those servers are again processed to deliver final output. It is scalable which works on more number of computers. Mapreduce will convert the data input into data output.
The framework sends the computer to perform task where the data resides. It is basically performs tasks in three stages such as Map, Shuffle and Reduce.
Mapreduce program executes Mapper as well as Reducer across data set which performs on two layers. Mapreduce Job is work that client needs to perform. The job includes input data, Mapreduce program as well as configuration file. The job will be performed on the data using the Mapreduce program with configuration file.
T-Task I-in P-Progress (TIP)
The task in Map reduce program for executing either at Mapper or Reducer is known as T-Task I-in P-Progress (TIP).
The node which executes task is known as task attempt. There is no possibility for getting the node machine failure. If any node goes down, Mapreduce framework assigns the task with other node. There is an upper limit for assigning the task to another node. When the task gets 4 times failure, then rescheduling will be done.
The first phase in Map reduce program is Mapping. The key element of Mapping is key/value pair. The framework transforms the incoming data into key/value pair for both structured as well as unstructured data. Key references input value. Data set is actual value.
Processing of Map
Based on the requirements, user defines function to process on the input value. As result output will be generated. The outputs are stored in the local disk.
The phase utilized intermediate key and value pair for processing the mapper output. The reducer performs summation or aggregation tasks. The mapper produces input to the reducer. Using the key and value pair the inputs are sorted.
Processing of Reduce
The final output is retrieved from reduce process. It will be stored in HDFS.
The task of mapper is to process the Input data. The input data exists in files or directories those are stored in HDFS. The line by line input file is passed to the function Mapper. The mapper processes and makes multiple chunks of data.
The phase includes both shuffles as well as reduces. The data comes from mapper is fed into reducer for processing. It sends new form of output which will be stored in HDFS.
The Mapper sends Map as well as reduces tasks for servers during the mapping phase. The framework handles details such as issuing tasks for servers, verification of tasks and copying data between nodes etc. servers perform the data processing tasks locally which reduce network traffic. After completing the task, output will be gathered from multiple nodes to new form new output.
Challenges of Mapreduce in big data management
Mapreduce performs tasks in two phases such as Map and reduce. It consumes more time.
Hadoop framework supports only batch processing. It does not support streaming data. Mapreduce framework cannot increase the memory of Hadoop cluster to support streaming data. Apache spark resolves the problem of streaming data. It is much efficient than Flink. It is efficient for batch as well as streaming processing.
The operations of Mapreduce are slower in Hadoop due to support multiple formats, structure as well as huge amount of data. Mapreduce converts the input data into another form such as key and value pair in Map phase. Reduce phase derives the output from Map and process with the data. The switching process consumes more times which increases latency. The issues can be resolved using Apache Spark as well as Apache Flink.
Mapreduce needs developer to make code for each and every operation which makes difficult for working. It has no iterative mode. The tools pig and hive reduce the difficulties for working with Mapreduce.
Task 2 Big data Visualization
Big data visualization approaches are converting the data into pictorial representations for easy interpretation and decision making. It converts the any type of data into pictorial form. The techniques are helpful for decision makers for exploring datasets as well as identifying unexpected patterns and correlation in the data set (Olson, 2019).
One of the important features of Big data visualization techniques are scalability. Big data includes huge amount of data those take years to read for humans. But data visualization techniques converts those huge data into graphical representation which are easily reach human brain.
Organization generates huge volume of data every year. The problem of organization is the utilization of those generated data in a useful form. Data visualization techniques support organization decision makers for analysing those data. It offers the following benefits such as:
- Reviewing huge amount of data
The graphical representation of data illustrates huge amount of data which enables decision makers for gaining understanding about the data more efficiently and quickly than traditional way of spread sheet or tables
- Spotting trends
Time series data exhibits trend of data through visualization techniques. It enables business organization to capture the business trends.
- Identification of correlations as well as unknown relationships
The powerful feature of Big Data visualization is exploring big data. The data exploration is helpful for finding the unexpected or unknown knowledge from data sets.
- Presentation of data to others
The presentation is used for business operations. It conveys more meaning about the data more quickly and efficiently.
Big data visualization examples
Listing items as well as items sorted using single feature
Contour maps, proportional symbol maps, dot distribution map, Cartograms
Connected scatter plots, Timelines, time series charts, circumplex charts.
Tag clouds, Pie charts, histograms, , bar charts, , heat maps, tree maps, spider charts
Radial tree charts, Dendograms, hyperbolic tree charts
Based on the area of application, visualization approaches are classified into following categories such as
Information visualization approaches
- Multidimensional techniques are useful for visualizing data from single data set
- It can be applied for data from two or more data set.
- While visualizing data from more than one data set is less efficient than hierarchical visualizationtechniques
Hierarchical visualization techniques
The techniques used to analyse data from more than one data set. The following are example for Hierarchical visualizations such as
It illustrates the data sets in hierarchical clusters. These are used for understand the relationship between datasets
- Tree models
It depicts the tree like data structure of data set either from left to right or upside down.
- The techniques are useful for depicting the relationship between variables with in dataset
- Hierarchy relationship can be efficiently captured among multiple datasets through network model
Network visualization exhibits the relationship of various datasets. Examples of Network visualization techniques are as follows:
The diagram belongs to flow diagram which represents data structure changes based on time or some condition.
onal parameter than pie chart. Each sector will be evaluated through the distance from center, arc length and angle. The sectors stretched far away from center are more important than sectors less stretched.
Time series analyses are used to continuous data evaluation. For example, CPU usage on particular day can be visualized through time series analysis.
Challenges for Big data visualization
Scalability as well as dynamic is the major challenges of the big data visualization. Diversity and heterogeneity nature of big data poses problem for visualization. Big data requires massive parallelization for handling those data. High dimensionality and high complexity of big data challenges can be resolved through dimensionality reduction. Visualization with more dimensions are useful for extracting unknown knowledge from the dataset.
In big data, over connection among the objects exists. It will lead to minimization as well as loss of image on screen thereby lacks extraction of useful information from the dataset. This issue can be resolved with pre –processing techniques.
Large image perception
Visual noises can be resolved one of the simple technique such as large perception. Sometimes again large perception creates issues. Human cannot percept highly overloaded data. Human does not have the ability to processeshigh volume of data. Data filtering methods are applied to resolve the issue.
Loss of information
Visual noises and large perception of image lead to next level issue such as loss of information. The data filtering and aggregation methods may lead to important data or objects unnoticed. Aggregation methods consume more resources and time.
Requirement of high performance
Big data visualization includes both static and dynamic visualization. While using static visualization techniques for dynamic data set, it will create issues in the form of performance. Dynamic visualization needs more computing resources and time to process those data.
Visualization every point of big data leads to over plotting as well as overwhelming for user perception and cognitive ability. Scalability in Interactive and perceptual are important issues for big data visualization.
Most of the current big data visualization tools are lacking in performance in the form of functionalities, scalability as well as time of response.
Some of the methods for resolving the big data visualization challenges are as follows:
- Tag cloud
- History flow
Tools used to overcome the issues are
- MS BI
Data visualization issue of big data is not only related with the existing technologies but it also related with ability of human to capture the data. The development tools should focus on the two factors also.
Saket, B., Endert, A. and Rhyne, T.M., 2019. Demonstrational interaction for data visualization. IEEE computer graphics and applications, 39(3), pp.67-72.
Olson, D.L. and Lauhoff, G., 2019. Descriptive data mining. In Descriptive Data Mining (pp. 129-130). Springer, Singapore.
Prabhu, S., Dechant, T., Koleszar, L., Stephens, C. and Colling, J., 2019. Big Data Enhancement to the Integrated Environmental Quality Sensing System (No. DOE-IWT-18596). Innovative Wireless Technologies, Inc..
Ghorbanian, M., Dolatabadi, S.H. and Siano, P., 2019. Big data issues in smart grids: A survey. IEEE Systems Journal, 13(4), pp.4158-4168.
Liu, P., Loudcher, S., Darmont, J., Perrin, E., Girard, J.P. and Rousset, M.O., 2020, July. Metadata model for an archeological data lake. In Digital Humanities (DH 2020).
Gorelik, A., 2019. The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science. O'Reilly Media.