Advanced Databases and Applications

Contents

Keywords.

Abstract

Introduction.

Literature Review..

Clustering Algorithm Techniques.

Partitional Clustering.

Hierarchical Clustering.

Density Based Clustering (DBSCAN)

Neural Networks.

Fuzzy Clustering.

Grid clustering.

Conclusion.

Recommendations for future research, agreement and disagreement

References:

Keywords

Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH), Density based Clustering (DBSCAN), Distributed Density- Based Clustering (DDC), Ordering Point to Identify Clustering Structure (OPTICS), Clustering using Representatives (CURE) and Fuzzy C- Means (FCM)

Abstract on Clustering Techniques in Data Mining

Clustering is a method in which a collection of given data is split into different sets, these sets are known are clusters in such a manner that the data sets which same lie with each other in one cluster. Clustering plays a vital role in data mining because in data mining the data sets of very large amount. This study defines many techniques of clustering algorithms that are available for data mining and this paper gives an analysis on the base of different algorithms such as CURE, CLARANS, KMeans, CLARA, DBSCAN clustering.

Introduction to Clustering Techniques in Data Mining

The word clustering defined as the “unsupervised classification of different data sets or different observations, it means the data set have not been divided into any cluster and so they don’t have any class attribute attached with them. The term clustering is used in a wide range as one of the vital steps in the investigation of data evaluation. The different algorithm of clustering helps in finding the useful and not identified patterns of different classes. Clustering helps in dividing the data sets into objects of same groups. Objects which are not similar and they are placed in different clusters. A data object can be belonging to one cluster or it can belong to more than one cluster according to the selected metric. For instance, thing about a retail database that includes the information and data about the products taken by the customers, clustering help in grouping the customers as per the purchasing the patterns. When the objects combine in groups of clusters then simplification is accomplished and at this point some information is lost. To recover this lost information different clustering method is used and this is the main focus of this study. To select the suitable type of clustering algorithm that needs to applied is a critical step. The primary focus of this paper is to offer an explicit evaluation of distinct clustering algorithms that is used in data mining and it should be based on the similar criteria and the type of complexity. The paper contains 3 sections, the very first section is a literature review, second section is all about different clustering techniques in data mining and how these clustering techniques are better than other clustering techniques because of the time and space complexity. The third section provides the provides a table of different clustering with their applications and functions and then this section followed by a conclusion. It is very much important to write this review as it helps in addressing the issues that generates while carrying out the large and constant data sets. This paper will be discussed about different aspects of clustering such as partitioning based, hierarchal based, density based, model based and grid based to address many issues. Different types of sources used in this review such as Electronic libraries and reliable information available on the internet, and periodical, Journals, magazines, newsletters and report from the collaborated sector.

Literature Review of Clustering Techniques in Data Mining

According to (Dalal et al.,2011) the clustering techniques can be divided into three primary techniques; first is portioning clustering, second is density-based clustering and third is Hierarchical Clustering, and the Hierarchical clustering is further divided in various parts as demonstrated in figure 1. As per the study of (Liu et al., 2015) if the data is continuous and large according to its nature then the conventional methods of mining are not suitable due to the real time data which changes very quickly and needs instant response as well. This is the issues in earlier time to handle the rage and continuous data sets. To access the data stream randomly becomes very expensive so only one access streaming data is offered and the storage that is required is also very large. Hence, for streaming data, two approaches are required such as data mining and clustering. (Fahad et al., 2016) described the effectiveness about the candidate clustering algorithm and it can be calculated with the help of various external and internal validity metrics, scalability tests, run time and stability. Large amount of information or huge data, has its own weaknesses as it requires a large volume of storages and this much of volume prepares functions like analytical functions, process functions, retrieval functions, it become very large and very time consuming. To address these issues of challenge of big data should be clustered in a compact format and it is an informative version of the whole data. There are other clustering techniques available to deal with the large datasets like BIRCH, OptiGrid and DENCLUE.

There is no clustering algorithm present at this point to evaluate the criteria and future work is dedicated to address these kinds of issues and challenges for each clustering algorithm to handle with big data. As per the study of Douglass document clustering is not received very well because some constraints. The techniques of clustering are relatively slow while dealing with collection of large documents and in addition to this, it is the thought that the technique of clustering does not assist in enhancing retrieval. Douglass defined technique of document browsing that uses various fast algorithms of clustering as main functions. They have prepared a browsing technique which is very interactive. This method is called as scatter/gather method. This technique is required because is fast clustering technique. The working of this particular technique can be understood by taking an example of book. If in a book, there is a requirement to find some term then it can be searched directly through the index conversely, if look for the answer for some other question, there is a need to go table of content. The system which is proposed that is scatter/gather use cluster-based navigation technique to navigate between other documents. These techniques can combine one or more same documents for reference of future. In beginning, the system scatters all the processed cluster documents in various groups and collect process and then chose group to make sub collections. This particular technique presents and offer two facilities, one is re-clustering and other is clustering. For this method, it is required to use fractional algorithm and some buckshot, as per (Singh et al, 2013).

Distributed clustering technique works on two different kinds of architecture, one is homogenous datasets: in this particular data sets, all the local site has similar attributes and second is heterogenous datasets: in this particular data sets, all the local site has distinct attributes but attaching among clusters relied on a general attribute. According to the study of (Zhao et al., 2012) high quality techniques and algorithms and fast algorithms are necessary to browse large volume of data. Here, one can use agglomerative clustering and portioned clustering, these two are part of hierarchal clustering. By evaluating the performance on the basis of experiment, it was observed that partitional algorithm is much better then agglomerative algorithm because the computational requirements are very less and performance of clustering. (Guha et al., 2018) defined that clustering algorithm of new class and it is named as constrained agglomerative algorithms because it is an amalgamation of agglomerative and partitional approach both. In this study, a new type pf clustering algorithm proposed which is very robust and it deals with those clusters that are sphere in shape. The hierarchal clustering algorithm provides a middle field in between the all point and centroid based method. (Guha et al, 2018) offers a comparison in between the CURE and BIRCH algorithm. To deal with the large set of databases, this algorithm uses sampling and random portioning. The algorithm named clustering algorithm used in various applications such as image segmentation, speech recognition, character recognition, vector quantization, and many more. These examples are the parts of machine learning, so automatically clustering algorithm plays an important in machine learning. The main challenge in the clustering algorithm is that it selects the initial centroid randomly and their capability to handle with constant arrival of information. This study presents a classification of information with the distributed nodes. If the conventional clustering techniques/algorithms are implemented in distributed database then it needs to transmit all the sets of data to the central site, however it is illogical because of the presence of huge datasets at every local site since the privacy and bandwidth concern is limited. The distributed clustering is of two types; hard clustering; it is also known as crisp datasets and every data object can present in one cluster only that is, in this the clusters are disjoint. For instance, k-means clustering and PAM. The second type is soft clustering, in this each data object fits in each cluster, for instance, particle swarm optimization (PSO), neural networks. This is a stochastic optimization technique which is a population-based method, this has been prepared on the basis of swarm particles. Various researchers defined the PSO algorithm, this technique is used to make clusters on the basis of some sample features that are, accuracy, efficiency and error count (Guha et al, 2018).

Clustering Algorithm Techniques

Partitional Clustering

This particular clustering represents with the help of prototype and also need to use iterative controstrategy to make the cluster optimum. The portioning algorithm splits the data into different subsets of data which is called partitions and every partition is a cluster. The clusters that formed having these characteristics;

  • At least one object should be there in every cluster
  • Every object must be a part of only one cluster, there should be no overlapping.

This technique of algorithm is divided into two categories, first is medoid algorithm and second is centroid algorithm. In medoid algorithm, every cluster includes the instances that are very close to the centre of gravity. In centroid algorithm, the centre of gravity is helpful in representing every cluster. For instance, K-means technique of clustering, in this method, data set is divided into k number of subsets in such a way that all the points which is contained in a subset are very much close the centre of gravity. The efficiency and effectiveness of this particular algorithm is based on the aim function that is used to determine the distance among sentences. There is a requirement of this technique that all the information must be available before.

Time Complexity: the time complexity is O(nkl), n represents the patterns’ number, k represents the clusters’ number and l represents the iteration required to coverage. The numbers of the l and k are fixed before attaining the complexity of linear time.

Space Complexity: the space complexity is O(k+n) and more additional storage is needed for storing the information.

Hierarchical Clustering

Dendogram is known as tree of clusters and it is constructed in Hierarchical technique according to the medium of proximity. Every cluster node includes another node called offspring nodes and the nodes which belongs to same parents, these are known as sibling nodes. Hierarchical method has a very special property that is called quick termination. Instance of this particular clustering are, BIRCH, CHAMELEON, CURE etc. this technique can be further bifurcated into Agglomerative Method and divisive method.

Agglomerative Method:

This particular method works in bottom up manner that is it make a group of all instances of data that fits into one cluster. All the pairs which are close integrate together. The closeness of pairs can be described as complete- link and single- link.

  • Single- link closeness: this closeness is defined as the sameness among instances and both these instances are from different clusters. Elliptical shapes clusters can be handled with efficiency and efficiently however it is very sensitive to errors and noise.
  • Complete- link closeness: this closeness is based on the sameness among most different illustrations and every fit in different cluster. Convex shaped cluster does not get fits into this cluster, however this I not verify sensitive to errors, noises or outliers.

Divisive Method:

This particular method works in a top down manner in which data can be divided inti low volume clusters till every cluster includes exactly one instance of data.

Space Complexity: it is O(n2 )

Time Complexity: this particular complexity is O(n2 logn), in this n represents the pattern’s number

Density Based Clustering (DBSCAN)

IN this type of clustering, distinct metrics of distance are used and integer of clusters is calculated automatically with the help of algorithm. All the data objects required to separate from each other based on the boundary, connectivity and their area. In DBSAC, either the data points related to any cluster or they categorised as noise. The points of data can be classified as core points, noise points and border points.

 Core points: Points which are set in within the cluster is known as core points. A point can be known as the data point within the cluster if the data points which exist in the neighbourhood has high threshold value.

Border points: these points are those which does not belong to core points but these lie in the close proximity to core points.

Noise points: these points are those which is neither border point nor core point.

Time Complexity: O (m* t), t is the time which is required to find the points in the close areas.

Space Complexity: O(m)

Neural Networks

This is called non linear data modelling technique that help in simulating the brain’s working. These networks are used to identify the connection among the patterns that depends on the output and input values.

Fuzzy Clustering

This clustering is based on reasoning technique that make association with different patterns and the clusters and it carried out on the basis of functions of membership. It produces overlap clusters.

Grid clustering

This particular Grid based technique is very much useful in spatial data mining. This splits the space into various cells and the functions are then carried out in the quantised space.

Clustering algorithm

Function

application

K means and fuzzy c means

Helps in extracting genetic patterns

Battle simulations

Density based sequencing clustering

Helps in identifying the climate patterns

Metrological department

Hierarchal clustering

Helps in identifying the trend change in traffic

Traffic department

K means clustering

Helps in share and stock patterns

Stock market

Mean shift clustering

Recognition of features

Capturing images

EM learning

Supervise the condition of tools

Public data

Conclusion on Clustering Techniques in Data Mining

This paper discussed about various kinds of clustering such as partitional clustering, density-based clustering, hierarchical clustering; the clustering which is Grid based and their time and space complexities and how these clustering help in data mining. Partitional clustering generally represents the clusters with the help of prototype. It is very useful for convex shapes and these shapes should be of similar size and the numeral of clusters should be identified before. The main weakness is in prediction of number of clusters prior. They split the data sets into different levels, these levels of partitioning are known as dendograms. The above stated techniques and algorithms are very much effective for data mining however the formation of these dendograms’ cost is very expensive for large volume of datasets. Other algorithm which is also very useful in data mining for large volume of data sets is density-based clustering and this particular technique can easily identify noise and also handle the clusters of arbitrary shape.

Recommendations for Future Research, Agreement and Disagreement

Vectoral data is very useful in actual data mining, but the future data will be stored in much more complex data and data mining will have to adjust all the increasing volume of data. The other aspect is “patterns in data”, its importance will also increase in future because the study evolves with time. The next aspect is, the usability to detect the patterns which are understandable is increased and it will also be necessary to make the method of data mining user friendly. Data mining in future has to deal with all these complicated input and pre-processing, it is possible that the user in future has to adjust a greater number of data. Thus, accomplishing user friendliness with clear or even minimum parameterization is a primary aim. The usability can also be improved by finding new kinds of patterns that are simple to interpret even though the input is very complicated.

Even though no human can predict the future, it is believed that there are ample number of challenges and issues are waiting ahead and few of them cannot be foreseen at the present moment.

References for Clustering Techniques in Data Mining

Dalal, M. A. & Harale, N D. (2011). A survey on Clustering in data mining, International Conference and Workshop on Emerging Trends in Technology. TCET. Mumbai. India. 559-562

Guha, S., Rastogi. R. & Shim, K. (2008). CURE: An efficient Clustering Algorithm for Large Databases. 73-84

Zhao, Y. & Karypis, G. (2012). Evaluation of Hierarchical Clustering Algorithms for Document Datasets. Singh, D. & Gosain, A. (2013). A Comparative Analysis of Distributed Clustering Algorithms: A survey, 2013 International Symposium on Computational and Business Intelligence. 165 – 169

Liu, Q., Jin, W., Wu,S & Zhou, Y. (2015). Clustering Research using dynamic modelling based on granular computing. 539 - 543

Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Zomaya, A., Khalil, I., Foufou, S., & Bouras, A. (2016). A survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis. 267 - 279

Remember, at the center of any academic work, lies clarity and evidence. Should you need further assistance, do look up to our Computer Science Assignment Help

Get It Done! Today

Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
Not Specific >5000
  • 1,212,718Orders

  • 4.9/5Rating

  • 5,063Experts

Highlights

  • 21 Step Quality Check
  • 2000+ Ph.D Experts
  • Live Expert Sessions
  • Dedicated App
  • Earn while you Learn with us
  • Confidentiality Agreement
  • Money Back Guarantee
  • Customer Feedback

Just Pay for your Assignment

  • Turnitin Report

    $10.00
  • Proofreading and Editing

    $9.00Per Page
  • Consultation with Expert

    $35.00Per Hour
  • Live Session 1-on-1

    $40.00Per 30 min.
  • Quality Check

    $25.00
  • Total

    Free
  • Let's Start

Get
500 Words Free
on your assignment today

Browse across 1 Million Assignment Samples for Free

Explore MASS
Order Now

My Assignment Services- Whatsapp Tap to ChatGet instant assignment help

refresh