Our current and future economy is data-driven, with increasingly connected environments, more flexible means of production and automated processes. In this context, the ability to identify anomalies quickly and reliably is an important competitive advantage.
In this article, we explain how the application of artificial intelligence (AI) and unsupervised learning in anomaly detection can help companies and industries in various sectors.
What do we call an “anomaly”?
The Collins English Dictionary defines anomaly as ‘something different from what is usual or expected.’ The essence of an anomaly is therefore something that does not turn out as expected based on known information.
We can better understand what an anomaly is by considering it in the context of specific applications. For example, an anomaly in a fiscal process could be an indication of fraud. Or in terms of industrial production, finding anomalies could indicate when it is necessary to carry out maintenance work on machinery, thereby reducing unnecessary maintenance costs.
The vast and ever-increasing amount of data generated by activity today allows us to use artificial intelligence and unsupervised learning algorithms to analyse it, recognise it and look for patterns that allow us to detect anomalies.
Unsupervised learning. What is it and why is it so interesting?
When talking about machine learning algorithms, there are different approaches that can be taken depending on the raw data. In some cases, our data model may have previously defined labels, i.e. the data will have defined target variables that will allow us to train our models.
Expressing this mathematically: we would know the input variables (x) and the output variable (Y), so we would have the necessary information to learn the function that relates the two, where Y = f(x). With this starting information, we can train our model to predict the output (Y) when we have new input data (x). This is known as supervised learning.
So – what is the problem? The problem is that in the real world, most of the time, the data we deal with does not have predefined labels. That is, we are not going to know the ‘Y’ of our function, only the input variables ‘x’. It is therefore necessary for the machine learning model to be able to analyse the data, classify it and find – by itself – some feature that can be used to predict the output of the new data. This is what unsupervised learning is all about.
Unsupervised machine learning is one of the main branches of machine learning and it has a multitude of applications. One of the most important is anomaly detection: identifying normal patterns within a data sample and then detecting outliers based on the natural characteristics of the data set itself.
Types of problems where unsupervised learning is applied.
Unsupervised learning applications can be divided into two main strands: clustering and association.
Association: the aim is to discover and learn representative rules within the dataset; for example, that customers who buy product A also tend to buy product B.
Clustering: Clustering applications seek to learn inherent groupings in the data, such as differentiating customer segments based on their buying behaviour.
Unsupervised learning algorithms can be applied in many fields and are especially useful for anomaly detection. One of the most common algorithms is Density Based Scan Clustering (DBSCAN).
Density Based Scan Clustering (DBSCAN).
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a very useful algorithm for identifying noise in data. The logic of this algorithm is based on identifying a series of central points determined by the neighbouring points of each data within a defined radius. These ‘neighbourhoods’ of data will form clusters, so that data outside of them will be identified as noise.
The main advantage of this algorithm is that it specifies the number of clusters itself and can group them in many different shapes and sizes. This makes it a very useful algorithm for working with noisy data and outliers.
Another typical algorithm for anomaly detection is Isolation Forest. The logic it follows is different from other known methods and revolves around the idea that abnormal points within datasets are easier to isolate than normal points. To achieve this, the algorithm generates partitions of the dataset by randomly selecting an attribute, then takes a random value of that attribute and splits the sample into two parts, grouping those above and below that value. These operations are repeated until all observations are isolated.
The main differentiation of this algorithm is that it requires less processing power than other methods, which makes it particularly suitable for large datasets (high-dimensional data).
Applications of unsupervised learning and anomaly detection.
Anomaly detection in industry
Industry is one of the sectors where this type of algorithm is most widely applied. One of the most common uses is in quality control, where they can help to reprogram industrial computers for the production of new items. They can also help optimise supply chains by detecting where anomalies occur in downtime, as well as many other uses such as analysing customer purchasing behaviour, inventory management, etc.
Anomaly detection in other sectors
The healthcare sector is one of the sectors where artificial intelligence is being most widely applied. There are many diagnostic and patient monitoring tasks that can be analysed by anomaly detection, such as the detection of erroneous treatment plans based on radiotherapy data series. It also has an interesting use in epidemiology, where it can detect the emergence of pathogen mutations based on patients’ responses to treatments.
In the financial sector, one of the main uses of anomaly detection is to detect fraud in electronic payments. It is also useful for the detection of creditworthiness when granting loans and the prediction of bankruptcies and it has various applications for optimising stock market investments.
Centum: experts in Smart Factory.
At Centum we develop projects and offer Industry 4.0 (Connected Industry) solutions, optimizing processes through the use of Big Data and Artificial Intelligence algorithms. If you want more information about our services, please contact us.