Expert Insights | Earley Information Science

Cluster Analysis in Big Data Mining Explained - Without the Math

Written by Seth Earley | Sep 28, 2015 4:00:00 AM

So we’ve all heard the term “big data” by this point - lots and lots of data coming from many sources, some of them disparate, most of them unstructured, all of them containing valuable insights.  Several approaches have been developed or are in development to harness the implied power in this data, and one of them is known as “cluster analysis”.  As an information scientist I am particularly interested in how this clustering or grouping can reflect patterns and groupings, whether correlative or associative, and how they can help me make some sense out of so much unstructured data.  I am also interested in how taxonomies, ontologies, and other controlled vocabularies are used to help inform these processes and deepen the analysis.

High level definitions

While the techniques used to analyze data clusters are mathematical in nature and usually lie within the purview of computer science, I’ve tried to summarize them here without the math so we can get a high level understanding of how they operate and what they accomplish.

Supervised vs. unsupervised learning

Let’s begin with supervised versus unsupervised learning. 

Supervised learning means analysis with predetermined parameters, which is categories and groupings already defined for the system to use. 

Unsupervised learning means analysis with no predetermined parameters, which is the system has to create its own categories and groupings as it progresses through the data. 

Agglomerative or Divisive

Once we’ve established parameters or not, we need to decide if we will be taking an agglomerative or divisive approach.  

Agglomerative, or “bottom up”, creates clusters based on similarities or dissimilarities starting with individual points. 

Divisive, or “top down”, breaks apart existing clusters of data points also based on similarities or dissimilarities to establish more refined clusters. With any massive dataset there will inevitably be outliers or “noise”, which is points outside the mean that don’t fall within a specific range or curve, and these can skew the results.  

Algorithmic or probabilistic methods

Another consideration is whether to use algorithmic versus probabilistic methods.

Algorithmic methods base analyses of clusters on a specific algorithm, which is a rule or set of rules that gives specific treatment options.  Also known as “hard assignment”, these methods generally choose a starting point within each cluster to establish a median or medoid.  The “K-means” type algorithm establishes starting points in the middle, with regards to outliers. The “K-medoid” type algorithm establishes starting points in the middle without regards to outliers.

The second “probabilistic” clustering method, also known as “soft assignment”, bases analyses on the spatial probability of data points and outliers. This informs and prepares the analyses ahead of time whilst also incorporating an element of machine learning.

Dimensionality

Another major issue with clustering big data is dimensionality.  Not only are the data points high in number, they are often very high in dimensionality.  Some points can have dimensions that number in the thousands or even tens of thousands, so graphing data dimensions and finding the similarities and dissimilarities to create associative relationships is critical.  One popular approach to handle such volume is to create micro clusters until macro clusters emerge.   These methods continue until all points are accounted for, and some work again in reverse.  Patterns emerge and can be represented visually or mathematically.

The data dimensions, also known as attributes, can be ordinal, nominal, or numerical.  

Ordinal dimensions determine ranking and orders of the various data points.

Nominal dimensions are qualitative and descriptive, such as color and material. 

Numerical dimensions are simply numerical without regards to order, such as size or weight.  

Distances between these dimensions are computed based on different mathematical distance types, such as Gaussian, Manhattan, or Euclidean. These computations show relative similarity and dissimilarity between points and help create clusters.  These distances can also be referred to as “reachability” between points, and thus be further expounded upon by creating yet another clustering parameter.

As the dimensionality of the data increases, the harder it is to cluster. This is known as the “curse of dimensionality”.  Partitioning and grid based clustering are two methods which can help handle very high dimensional data.  These methods look for subspaces within high dimensional space to increase efficiency and scalability.  In other words they start with 1D, then go up to 2D, 3D etc.

Big data cluster analysis applications

In conclusion, here are some pertinent applications of cluster analysis in big data: social media trends, buying trends, identifying terrorist cells, predicting system failures, predicting disease in patients, calculating risk for insurance companies, and even city planning.  With so much data being produced, the sky is becoming the limit for applications. 

In addition, applying these analyses in cognitive computing is paving the way for more sophisticated and intelligent agents by means of learning algorithms and predictive analytics.  Driven by and informed by taxonomies, ontologies, and other controlled vocabularies, these systems are becoming acutely able to ingest, analyze, theorize, and subsequently offer complex and dynamic solutions to many problems.  See below for some specific examples of where cluster analysis in data mining can help businesses achieve their goals. 

Marketing

A Director of Customer Relationships has five managers working for him. He would like to organize all the company’s customers into five groups so that each group can be assigned to a different manager. Strategically, he would like that the customers in each group are as similar as possible. Additionally, two given customers having very different business patterns should not be placed in the same group. His intention behind this business strategy is to develop customer relationship campaigns that specifically target each group, based on common features shared by the customers per group.  By clustering customer data, he can create these groups and assign accordingly.

Fraud detection and credit analysis

To the business of a bank, many factors can strongly or weakly influence loan payment performance and customer credit rating. Data mining methods, such as attribute selection and attribute relevance ranking, can help loan payment prediction and customer credit analysis identify important factors and eliminate irrelevant ones.  Additionally, clustering methods can help pinpoint fraud by detecting outliers.

Improve customer service

The telecommunication industry has quickly evolved from offering local and long-distance telephone services to providing many other comprehensive communication services. The integration of telecommunication, computer network, Internet, and numerous other means of communication and computing has been changing the face of telecommunications and computing. Data mining can help understand business dynamics, identify telecommunication patterns, catch fraudulent activities, make better use of resources, and improve service quality.  Additionally, clustering methods can help make sense of customer feedback through behavior and sentiment analysis.

Need help with your own transformation program? Lay the foundation for your organization’s success with our Digital Transformation Roadmap. With this whitepaper, assess and identify the gaps within your company, then define the actions and resources you need to fill those gaps.