Integrating

Big Data in Cloud Environment – A Review

Mr. Deepak Ahlawat1, Dr. Deepali Gupta2

PhD Research Scholar MMU Sadopur1, HOD

CSE MMU Sadopur2

[email protected],

[email protected]

Abstract:

In

this paper the concept of the Big Data and Cloud Computing are integrated and reviewed. Big

data term refers to huge volume of data in today’s internet environment, much

of which cannot be integrated easily. Cloud computing and big data go

hand in hand. Big data gives the users the ability to utilize massive computing

power to process the distributed queries in different datasets and return

outcome sets in a timely manner. Cloud computing is the underlying engine that

along with Hadoop, provides the platform for distributed data-processing. In

the later section, future work with the integration of big data and cloud

computing are presented.

Keywords: GA, PRF, CURE.

1 Introduction

1.1. Big Data

Big data 1 can be characterized by 4Vs: the extreme volume of data, the wide

variety of types of data, the velocity at which the data must be must processed

and the value

of the process of discovering huge hidden values from large datasets with

various types and rapid generation. . Big data term refers to huge volume of data in today’s

internet environment, much of which cannot be integrated easily.

Big data takes huge amount of time and

costs/money to get some useful analysis done on it. As knowledge can only be

drive from a careful analysis of data (Data Mining), thus several new

approaches to storing and analysing data have emerged. Instead, raw

data with

extended metadata is aggregated in a data

lake and machine

learning and

artificial intelligence (AI) programs use

complex algorithms to look for repeatable patterns 2. Collection of large amount of data takes place because

of the human involvement in the digital space. The work is being shared stored

and managed and lives online. As an example, approximately several terabytes of

data daily uploaded and viewed on Facebook.

Fig.1. Big Data Classification

This kind of huge data with useful information is

known as big data. Clustering is the capable data mining method using widely

for mining valuable information in the unlabeled data. From the last few

decades, numbers of clustering algorithms are developed on the basis of a

variety of theories plus applications.

1.2. Cloud Computing

A cloud is a

computing process in which services are dispersed above network by computing

processes 3. Service models consist of three main categories 4:

Software

Platform

Infrastructure

Fig.2.

Service Models

SaaS

(Software as a Service)

·

The

web access is given to commercial software.

·

From

a middle location, the software is managed.

·

One

–to-many is the way for delivering the software.

·

The

users don’t need to manage software improvements and patches.

·

Among

number of software’s, Application Programming Interfaces (APIs) allows the

integration.

PaaS (Platform as a Service)

·

To allow the services to expand, experiment, organize, host and

protect the application in the same integrated improved atmosphere and the

equivalent services desired to accomplish the application development

procedure.

·

The web build user interface formation tools assists to make,

adapt, test and organize dissimilar UI framework.

·

Multi-tenant plan that has numerous simultaneous users use the

similar growth application.

·

Constructed in scalability of deployed software counting load

balancing and failover.

·

Addition with the web services and databases of frequent

standards.

·

Sustain for growth team collaboration – some PaaS solutions

comprises of project planning and communication tools.

·

Tools to handle billing and subscription management.

IaaS (Infrastructure as a Service)

The

resources are dispersed as a service.

It

permits for effectual scaling.

It

has a patchy cost, usefulness pricing model.

Usually

it has a multiple user environment.

1.3. Relation of

Cloud Computing and Big Data

Cloud computing

and big data go hand in hand. Big data gives the users the ability to utilize

massive computing power to process the distributed queries in different

datasets and return outcome sets in a timely manner. Cloud computing is the

underlying engine that along with Hadoop, provides the platform for distributed

data-processing 5. The relation between cloud computing and big data is shown

in below figure. The large data sources from the cloud and Web are being stored

in a distributed fault-tolerant database and processed via the programming

model for huge datasets with parallel distributed algorithm within a cluster

6.

Fig.3. Relation of Cloud Computing and Big Data

1.4.

Clustering

in Big Data

Data clustering is known as a problem of a partition of unlabeled objects

sets that is O = {o1, o2,

. . . , on} in k groups of alike objects, in which 1 <
k < n. Before clusters could be required, it become necessary for
estimating k, this is the problem of cluster tendency 6. When each
object is shown by attributes vector, data clustering is developed on feature
vectors xi ?
Rp, in which xi is the p-dimensional
feature vector for oi, 1 ? i ? n. These data
could be shown in the form of an n × n dissimilarity
matrix D, having Dij representing dissimilarity (distance) among oi
plus oj. Basically,
the Euclidean distance ||xi ? xj|| is known as the dissimilarity measure, but it could be
some norm on Rp 7.
Following are some of the clustering algorithms:
1.4.1.
K-mean
clustering
The k-means
clustering algorithm is the fundamental algorithm which is dependent on the
partitioning method using for many clustering tasks mainly with low dimension
datasets. It utilizes k as a parameter, with the division of n objects in k
clusters for the objects in the similar cluster to behave similar to every, but
different to another objects in other clusters. The algorithm normally finds
the cluster centers, (C1 …… Ck), for minimizing the sum of the
squared distances of every data point, xi, 1 ? i ? n, to its
nearest cluster center Cj, 1 ? j ? k. Initially, the algorithm
arbitrarily selects the k objects, showing a cluster mean/center. Later, the object xi
in the data set is transferred to the adjacent cluster center i.e. to the
parallel center. The algorithm calculates the novel mean for every cluster and
re-assigns every object to the adjoining new center. This method iterates till
no amendments occur for the assigning the objects. The convergence outcome
minimizes the sum-of-squares error which is defined as the squared distances
sum from every object to its cluster center 7.
1.4.2.
Fuzzy K-mean
Fuzzy
K-Means is also known as Fuzzy C-Means Clustering, which is the extension of K-Means
technique 8. The K-Means algorithm only finds the clusters of regular
shapes, i.e., Hard Clusters, but Fuzzy K-mean is also suitable to find the Soft
Clusters 9.
The fuzzy k-means algorithm is described as
follows:
1.
To assume a
fixed number of clusters k.
To Randomly initialize the k-means connected with the clusters with the
computation of the probability that every data point is a member of a known cluster k,
2.
To recalculate the centroid of the
cluster as the weighted centroid mentioned the probabilities of
membership of all data points
3.
To iterate till convergence of
a user-specified number of iterations being reached.
1.4.3.
Clustering using Genetic Algorithm
GA (Genetic
algorithm) was proposed early in 1989 that attracts many attentions as it
perform a globalized investigation for solutions whereas another clustering
approaches execute a localized search and therefore, simply get stuck at local
optimality's. In a localized search, the novel obtain solution take over the
ones in the preceding iteration. Such example includes k-means, ANNs, fuzzy
clustering algorithms with tabu search, annealing schemes. However, in Genetic
Algorithm, the crossover and mutation operators could produce novel solutions
that are very dissimilar from the preceding iteration which is where the global
optimality basically comes 10. Also, Genetic algorithm works paralleling, making
it possible for implementing parallel hardware for speed up the execution.
In
fact, Genetic Algorithm is known as evolutionary approach, which applies
evolutionary operators and solutions population for achieving a partition of
global optimal. GA includes selection of functions, mutation operation, and a
fitness function. The candidate solutions to the clustering problem are being
encoded as chromosomes, and later a fitness function inversely proportional to
the squared error value is applied for determining the chromosomes existing
likelihood in the subsequent generation 11.
2 Related Works
Chen et al., (2017) presented Parallel Random Forest (PRF) algorithm for big data on the
Apache Spark platform. Researcher optimized the PRF algorithm on a hybrid
approach combining the data-parallel and task-parallel optimization. In
data-parallel optimization, vertical data-partitioning method is used and in
task-parallel optimization, a dual parallel approach is carried out. Both the
techniques sufficiently improve the efficiency. Moreover, by using the
dimension-reduction in the training process and weighted voting approach in the
process preceding parallelization improves the accuracy for algorithm for
large, noisy and high-dimensional data. Experimental results are superior to
previous results implemented by Spark MLlib and other studies 12.
Xu et al., (2017) had designed a speculative execution schemes for parallel processing
clusters. Researchers devised two schemes: one for lightly loaded systems and
other for heavily loaded systems. For light loaded systems, they proposed Smart
Cloning Algorithm (SCA) and for heavily loaded systems, Enhanced Speculative
Execution (ESE) Algorithm is proposed. The simulation result compares the SCA
and Microsoft Mantri, in SCA the total job flowtime is reduced by 6% in
comparison to Microsoft Mantri. In terms of the job flowtime, the ESE algorithm
outperforms the Microsoft Mantri baseline scheme by 71% 13.
Thingom et al., (2017) discusses the concept of the integration of big data and cloud
computing. Researchers pointed out the flexibility and minimum cost (pay &
use model) required in the cloud scenario 14.
El-Seoud et al., (2017) showed the trends and challenges faced in the field of big data and
cloud computing. Study reveals the risks plus benefits that may arise due to
the integration of big data and cloud computing. The study also unfolded the
concepts behind big data and cloud computing 15.
Bharill et al., (2016) has focused his paper on clustering large datasets in Apache Spark
environment. Authors designed and implement partitioned dependent clustering and
choose the specified environment because of its low computational needs. In
this research, Scalable Random Sampling with Iterative Optimization Fuzzy C-Means
algorithm (SRSIO-FCM) is implemented on an Apache Spark Cluster. The
experimental studies on different big datasets are conducted. The performance
of SRSIO-FCM is better in comparison to the Literal Fuzzy C-Means (LFCM). The
results are stated in terms of space and time complexity. According to the
results, SRSIO-FCM runs in less time without compromising the clustering quality
16.
Wei Shao et al., (2016) have presented a model for clustering data by means of
spatiotemporal-intervals, which is consider as a spatiotemporal data type
connected with a start- and an end-point. The model proposed by the researcher
could be used to evaluate the spatiotemporal interval data clusters. The work
has aimed to deal with the evaluation of clustering results in variety of
Euclidean spaces. This is dissimilar from the existing clustering that
calculates the outcome in space of single Euclidean. The existing clustering
algorithms are analyzed and compared with the use of energy function 17.
Sun et al., (2015) has done clustering with the use of time impact factor matrix. The
matrix monitors how user interest drifts and then predicts the rating of the
item. In addition to the time impact factor matrix, the author has added one
more time impact factor and use the linear regression for predicting the user
interest drift. The comparisons of the experiments have been conducted on three
big data sets, namely, MovieLens1M, MovieLens100K, and MovieLens10M. The
results have shown that the proposed approach has efficiently improved the
prediction accuracy 18.
Sookhak et al.,
(2015)
improves the storage capability of the cloud system by reducing the
communicational and computational overhead costs. Authors proposed an RDA
(Remote Data Auditing) technique which is dependent on algebraic signature properties.
The authors also design a novel data structure, DCT (Divide and Conquer Table)
which could effectively supports operations of dynamic data like insert,
append, delete and modify. The DCT data structure could be applied to the
storage of large-scale data and incurred less computational cost. The
comparison among the proposed method and other RDA techniques has shown that
the proposed method is efficient and secured, and hence reduces the computational
and communication costs on the server and the auditor 19.
Kumar et al.,
(2015) proposed the ClusiVAT algorithm. The proposed
algorithm is compared with the K-means, single pass K-means, online K-means and
CURE (Clustering using representatives).
The comparison results show that ClusiVAT is the fastest and accurate among all
five algorithms. For example, it has recovered 97% of the ground truth labels
in the real world KDD-99 cup data (4 292 637 samples in 41 dimensions) in 76 s
7.
Hashem et al.,
(2014) studied
how the vast amount of data (Big data) and cloud computing is a challenge in
today's computer world. The characteristics definition with the classification
of big data has been discussed on cloud computing. The relationship among big
data and cloud computing, Hadoop technology with the storage systems of big
data are discussed. Also, the investigation of research challenges with the
focus on availability, data transformation, data heterogeneity, governance and
privacy with legal regulatory issues is taken place 1.
Yin et al.,
(2014) have
focussed on the detection of faults with the isolation for the systems of
vehicle suspension. The system being proposed is classified into mainly three
steps, primarily to confirm the number of clusters dependent on PCA (Principal
component analysis and secondly to detect the
faults by using fuzzy
positivistic C-means clustering with the fault lines and next to isolate the
root causes for faults by using the technique of Fisher discriminant analysis.
Dissimilar from another scheme, the proposed method only requires measurements
of accelerometers which are fixed on four corners of a vehicle suspension.
Moreover, dissimilar spring attenuation coefficients are being regarded as a
special failure in place of few others 8.
Konak et al.,
(2006) studied
the emerging technology GA (Genetic Algorithm) for the existing problems. Author
addresses the multi-objective formulations which are considered as realistic
techniques for problems of more complex engineering optimization. For real-life
problems, the objectives under consideration conflicts with each other and the
optimization of particular solution for single objective that could result in
unacceptable results for other objectives 11.
3 Future Works
Beyond the basic execution needs, small additional services
like Machine learning, Analytics, and Orchestration are being accessible by the
cloud. There are numerous reasons for this move as summarized below 20:
i.
Clouds are the main providers for data services.
ii.
Machine Learning and other AI approaches will
surely improve the scenario and Orchestration (Automation) would make the
service provider capable to have the Service level agreement on time.
iii.
Analytics would accelerate the business and
Orchestration can be helpful when the acceleration takes place.
iv.
The future of Clouds would be the mixture of Analytics
and Orchestration.
v.
Big Data and Cloud Computing will surely automate
the maximum workload in the distributed computing environment.
References
1.
Hashem et al., "The rise of "big data" on
cloud computing: Review and open research issues", Information Systems 47 (2014): 98-115.
2.
Wu et al., "Data mining with big data", IEEE transactions on knowledge and data engineering 26.1 (2013): 97-107.
3.
Subashini et al., "A survey on security issues
in service delivery models of cloud computing", Journal of network and computer applications 34.1 (2011): 1-11.
4.
Pallis et al., "Cloud computing: the new
frontier of internet computing", IEEE
internet computing 14.5 (2010): 70-73.
5.
Talia Domenico, "Toward cloud-based big-data
analytics", IEEE Computer Science (2013): 98-101.
6.
Fernandez et al., "Big Data with Cloud
Computing: an insight on the computing environment, MapReduce, and programming
frameworks", Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery 4.5 (2014): 380-409.
7.
Kumar et al., "A hybrid approach to clustering
in big data", IEEE transactions on cybernetics 46.10 (2015): 2372-2385.
8.
Yin et al., "Performance monitoring for vehicle
suspension system via fuzzy positivistic C-means clustering based on
accelerometer measurements", IEEE/ASME
Transactions on Mechatronics 20.5 (2014):
2613-2620.
9.
Zahid et al., "Fuzzy clustering based on
K-nearest-neighbours rule", Fuzzy
Sets and Systems 120.2 (2001):
239-247.
10. Maulik et al., "Genetic algorithm-based clustering
technique", Pattern recognition 33.9 (2000): 1455-1465.
11. Konak et al., "Multi-objective optimization using genetic
algorithms: A tutorial", Reliability
Engineering & System Safety 91.9 (2006):
992-1007.
12. Chen et al., "A
Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing
Environment", IEEE Transactions on
Parallel and Distributed Systems 28.4 (2017): 909-933.
13. Xu et al.,
"Optimization for Speculative Execution in Big Data Processing Clusters", IEEE Transactions on Parallel and
Distributed Systems 28.2 (2017): 530-545.
14. Thingom et al.,
"An Integration of Big Data and Cloud Computing", Proceedings of the International Conference on Data Engineering and
Communication Technology (2017): 729-737.
15. El-Seoud et al.,
"Big Data and Cloud Computing: Trends and Challenges", International Journal of Interactive Mobile Technologies 11.2
(2017): 34-52.
16. Bharill et al., "Fuzzy Based Scalable Clustering Algorithms
for Handling Big Data Using Apache Spark", IEEE Transactions on Big Data 2.4 (2016): 339-352.
17. Wei Shao et al., "Clustering Big Spatiotemporal – Interval
Data", IEEE Transactions on Big Data 2.3 (2016): 190 – 203.
18. Sun et al., "Dynamic Model Adaptive to User Interest Drift
Based on Cluster and Nearest Neighbors", IEEE Access 14.8 (2015):
1682-1691.
19. Sookhak et al., "Dynamic remote data auditing for securing
big data storage in cloud computing", Information Sciences 380 (2015): 101-116.
20. Furht et al., "Handbook of cloud computing",
Vol. 3. New York: Springer, (2010).
21. Azar et al.,
"Dimensionality Reduction of Medical Big Data using Neural-Fuzzy Classifier", Soft Computing: Springer 19.4 (2015):
1115-1127.
22. Cao et al.,
"Cluster as a Service: A Resource Sharing Approach for Private Cloud", Tsinghua Science and Technology 21.6
(2016): 610-619.
23. Han et al., "Data
Mining: Concepts and Techniques", Elsevier
(2006).
24. Kurasova et al.,
"Strategies for Big Data Clustering", IEEE
26th International Conference on Tools with Artificial Intelligence (2014):
740-747.
25. Reshmy et al.,
"Data Mining of Unstructured Big Data in Cloud Computing", International Journal of Business Intelligence and Data Mining 12.3
(2017).
26. Zhao et al.,
"Independent Tasks Scheduling Based on Genetic Algorithm in Cloud Computing", WiCom '09- 5th International
Conference (2009).