AI algorithms: unsupervised

View the above image in a separate web page:  https://www.leerschool.be/othercontent/aialgosen.php

Choose the appropriate algorithm depending on the desired goal or output

Depending on the "learning goal," as a developer, you choose a supervised or unsupervised learning method. For unsupervised, you then choose the desired goal: 

Unsupervised learning

  1. Clustering: Clustering aims to group similar data points together based on their features or proximity in the dataset. It identifies natural clusters or groups within the data without any prior knowledge of the class labels. The goal is to discover inherent structures or groupings in the data.

  2. Dimensional reduction: Dimensional reduction techniques focus on reducing the number of dimensions or variables in the dataset while preserving the most important information. This is beneficial when dealing with high-dimensional data. The goal is to simplify the dataset by transforming it into a lower-dimensional representation, which can aid in visualization, noise reduction, and more efficient processing.

  3. Pattern discovery(including association): Pattern discovery involves uncovering meaningful relationships, associations, or recurring patterns in the data. This can include identifying frequent itemsets or item combinations in transactional data, sequential patterns in time series data, or cluster patterns in spatial data. The goal is to gain insights into the underlying patterns and dependencies present in the data.

  4. Anomaly discovery:Anomaly discovery, also known as anomaly detection, focuses on identifying rare or unusual instances in the dataset that deviate significantly from the expected behavior or majority of the data. The goal is to detect outliers, anomalies, or abnormal patterns that may indicate potential errors, fraud, or anomalies in the data.

In summary, clustering groups similar data points together, dimensional reduction reduces the number of variables, pattern discovery reveals meaningful relationships or recurring patterns, and anomaly discovery identifies unusual or abnormal instances in the dataset. These techniques serve different purposes but all contribute to understanding and extracting insights from unlabeled data.

Make the exercise:  https://www.leerschool.be/quiz/unsupervisedalgosen/

1. Clustering algorithms

Clustering algorithms, such as K-means clustering, Hierarchical clustering, DBSCANand GMM,are used to group data points based on their similarity or proximity. The goal is to discover the intrinsic structure and natural groupings in the data.

K-means  

The previously mentioned K-Meansalgorithm is an example of an unsupervised learning method. It divides data into a user-defined number of classes (k = 1, k = 4...). So you decide how many groups you want to divide the input data into. The algorithm is useful for datasets where clearly distinguishable elements (points) are present.

  1. K-means is a method of dividing points into groups based on their distance from each other. Imagine you have a lot of points on a map and you want to divide them into different groups.
  2. K-means works by choosing a number of groups, let's say three groups. It starts randomly by placing three points, called "centroids," on the map. These points represent the centers of the groups.
  3. Then all other points are assigned to the closest center. So if a point is closer to the first center than to the other two, it is assigned to the first group.
  4. Then the centers are recalculated based on the average position of the points in each group. This means that the centers are moved to the average of the points in their group.
  5. This process is repeated until the centers stop changing or a certain number of repetitions is reached. Eventually you get groups of points that are close to each other and far from the other groups.
K-means helps us find patterns in the data and it can be used for various things, such as grouping customers based on their purchase behavior or identifying different types of plants based on their characteristics.

Hierarchical Clustering

Imagine you have a bunch of different animals, like lions, tigers, bears, and elephants. Now, you want to group these animals based on how similar they are to each other. One way to do this is through hierarchical clustering. Hierarchical clustering is like making a family tree, but instead of people, we're doing it with animals. We start by looking at each animal as its own separate group, like a big family with only one member. Then, we compare the animals to see how similar they are to each other. 

  1. Let's say we compare the lions and tigers first. We look at their characteristics, like their size, color, and the kind of sounds they make. If lions and tigers are very similar, we can put them in a group together.
  2. Next, we compare the bears to the lions and tigers group. If bears are more similar to lions and tigers than they are to elephants, we put bears in the same group as lions and tigers.
  3. We keep comparing and grouping the animals until we have a big family tree. At the top of the tree, we have the biggest group that includes all the animals. Then, as we move down the tree, we have smaller and smaller groups that are more similar to each other.

The great thing about hierarchical clustering is that we can see how the animals are related to each other. We can see which animals are more similar and which ones are more different. It helps us understand how things are organized and grouped together based on their similarities.

So, in simple terms, hierarchical clustering is like making a family tree for animals based on how similar they are to each other. It helps us see how animals are related and grouped together.

DBSCAN

DBSCAN stands for " Density-Based Spatial Clustering of Applications with Noise," which means it is a method of grouping points in a space based on their proximity to each other. Imagine you have a lot of points, say on a map, and you want to know which points belong together.

DBSCAN works by grouping points based on how close they are to each other. The idea is that if points are close together, they probably belong to the same group. It determines the groups based on two important things: the distance between the points and the number of points that are nearby.

Imagine you have a group of friends who all live close together. They can easily get to each other's houses by walking a short distance. In this case, you can say that they belong to the same group. But if you have another friend who lives farther away and cannot easily get to the others, he probably does not belong to the same group.

DBSCAN works in a similar way. It sets a certain distance, and if two points are within that distance of each other, they are considered to belong together. Then it looks at other points that are within that distance and adds them to the group. This process repeats until all points are grouped together.

DBSCAN is useful because it can also identify points that do not belong to a group, but are more isolated. These points are considered "noise." Imagine you have a few lonely houses that are far away from other houses. They don't really belong to a group, so they are considered noise points.

In this way, DBSCAN helps us find clusters or groups of points based on their proximity. It is used in several areas, such as analyzing data, identifying patterns in images and even finding groups in social networks.

2. Dimensional reduction

Dimensional reduction is a technique used to make things simpler by reducing the number of features or aspects we consider. Imagine you have a lot of information about something, like a picture with many details. Sometimes, all those details can make it hard to understand or analyze the big picture. So, dimensional reduction helps by finding the most important aspects and getting rid of the less important ones.

It's like looking at a picture and deciding to focus only on the main objects, ignoring the small details in the background. This way, we can understand the picture better and it becomes easier to work with.

Dimensional reduction can be done in different ways, but the goal is always to keep the most valuable information while removing the less useful or redundant parts.

This technique is useful in many fields. For example, in data analysis, it can help us understand complex data by focusing on the most important factors. It also helps in visualizations, making it easier to show information in a simpler and more understandable way.

An example:

Dimensional reduction in unsupervised learning helps us simplify large amounts of information. Imagine we have many images of different animals. Instead of using all the details of each image, we can identify important features like the shape of the snout, the ears, or the type of fur.  By keeping these important features and discarding the rest, we can simplify the images. This makes it easier to compare and group the animals based on their similarities. So instead of looking at all the details of the images, we focus only on the key features to solve the problem more easily.

3. Pattern discovery

Imagine you have a collection of your favorite toys scattered around your room. Now, you decide to organize them based on similarities. You start grouping the toys together based on their types, colors, or sizes. For example, you put all the stuffed animals in one pile, all the cars in another, and all the action figures in another. By doing this, you are discovering patterns or similarities among your toys. You notice that certain toys have similar characteristics and belong to the same group. This process of grouping toys based on similarities is similar to what pattern discovery does in unsupervised learning.

In the context of computers and data, pattern discovery involves finding similar patterns or relationships within a large dataset. The goal is to identify items, objects, or data points that share common features or behaviors. For example, in a grocery store's sales data, pattern discovery might reveal that customers who buy bread often also buy milk or that certain items are frequently purchased together, such as chips and soda.

Pattern discovery helps us understand how different things or events might be related. It allows us to uncover hidden structures or dependencies in the data that might not be immediately obvious. By identifying these patterns, we can gain insights, make predictions, and make better decisions.

So, in essence, pattern discovery is like finding hidden groups or connections within a dataset, just like organizing your toys based on similarities. It helps us discover interesting and useful information from the data we have.

The difference between clustering and pattern discovery:

  1. Goal and Approach:

    • Pattern discovery: The goal of pattern discovery is to uncover meaningful relationships, associations, or recurring patterns in the data. It focuses on finding similarities, dependencies, or interesting connections among the data points. Pattern discovery algorithms explore the entire dataset to identify patterns and relationships.
    • Clustering (including K-means): The goal of clustering is to group similar data points together based on their features or proximity. It aims to partition the data into distinct clusters, where points within the same cluster are more similar to each other than to points in other clusters. Clustering algorithms, like K-means, assign data points to clusters based on distances or similarities.
  2. Unsupervised Learning Techniques:

    • Pattern discovery: Pattern discovery involves various techniques, such as association rule mining, sequential pattern mining, or cluster pattern mining. These techniques search for interesting relationships, sequential orders, or patterns that occur frequently or have strong associations in the data.
    • Clustering (including K-means): Clustering techniques, like K-means, specifically focus on partitioning the data into groups or clusters. They calculate the similarity or dissimilarity between data points and assign them to clusters, aiming to maximize intra-cluster similarity and minimize inter-cluster similarity.
  3. Output:

    • Pattern discovery: The output of pattern discovery algorithms is typically a set of discovered patterns, relationships, or associations. These patterns may be represented as rules, sequences, or other forms that describe the discovered relationships in the data.
    • Clustering (including K-means): The output of clustering algorithms is a division of the data into clusters. Each cluster represents a group of similar data points. The output can be the cluster assignments for each data point or the cluster centroids.
In summary, while pattern discovery is focused on finding relationships, associations, or recurring patterns in the data, clustering (including K-means) aims to group similar data points together based on their features or proximity. Pattern discovery can be seen as a broader concept that encompasses various techniques, including clustering, for discovering interesting patterns in data.

4. Anomaly discovery

Imagine you and your friends always take the same route to school every day. You all know the typical traffic, the usual landmarks, and the time it takes to get to school. However, one day, you notice something strange during your commute. There's a huge traffic jam on the road that is usually clear. This unusual event catches your attention because it's not what you usually expect.

In unsupervised learning, anomaly discovery is like identifying those unusual events or "outliers" in a dataset. It involves finding data points or instances that are significantly different from the majority or unexpected based on the patterns observed in the rest of the data.

For example, let's say you have a dataset of daily temperatures in your city over the past year. Most of the time, the temperatures range from 15 to 30 degrees Celsius, but one day you see a temperature of 40 degrees Celsius. This exceptionally high temperature stands out from the rest of the data points and can be considered an anomaly.

Anomaly discovery algorithms analyze the dataset and look for data points that deviate significantly from the norm. They do this by examining the characteristics, patterns, or statistical properties of the data. These algorithms aim to identify unusual occurrences, unexpected behaviors, errors, or outliers that may be of interest or concern.

By detecting anomalies, we can identify potential problems, outliers, or interesting events that may require further investigation. Just like noticing the unusual traffic jam on your usual route, anomaly discovery helps us find those exceptional or unexpected occurrences in the data that may be important to understand or address.

So, in essence, anomaly discovery in unsupervised learning is like finding the unusual, unexpected, or different instances in a dataset, just like noticing something out of the ordinary during your daily routine. It helps us identify outliers or anomalies that may indicate something special or require our attention.
Next page