I repost and share here today a post from the excellent blog The Morning Paper. It concerns the application of recent techniques in the field of Big Data Machines in IT security systems. What is also of note is the growing importance of Artificial Intelligence in technological developments broadly, specially when combined with proper Human interaction.
Will machines take over? The lesson of today’s paper is that we’re better off together. Combining AI with HI (human intelligence, I felt like we deserved an acronym of our own ) yields much better results than a system that uses only unsupervised learning. The context is information security, scanning millions of log entries per day to detect suspicious activity and prevent attacks. Examples of attacks include account takeovers, new account fraud (opening a new account using stolen credit card information), and terms of service abuse (e.g. abusing promotional codes, or manipulating cookies for advantage).
The usual problem with such outlier/anomaly detection systems is that they trigger lots of false positive alarms, that take substantial time and effort to investigate. After the system has ‘cried wolf’ enough times they can become distrusted and of limited use. AI2 combines the experience and intuition of analysts with machine learning techniques. An ensemble of unsupervised learning models generates a set of k events to be analysed per day (where the daily budget k of events that can be analysed is a configurable parameter). The human judgements on these k events are used to train a supervised model, the results of which are combined with the unsupervised ensemble results to refine the k events to be presented to the analyst on the next day.
One of the important points of the post is the emphasis on using advanced Machine Learning techniques such as Principal Component Analysis and Recurrent Neural Networks to improve the performance on retrieving relevant information concerning the rate of false positives on IT attack events:
AI2 tracks activity based on ingested log records and aggregates activities over intervals of time (for example,counters, indicators – did this happen in the window at all? – elapsed time between events, number of unique values and so on). These features are passed into an ensemble of three unsupervised outlier detection models:
- A Principle Component Analysis (PCA) based model. The basic idea is to use PCA to determine the most significant features (those that explain most of the variance in the data). Given an input take its PCA projection, and then from the projection, reconstruct the original variables. The reconstruction error will be small for the majority of examples, but will remain high for outliers.
- A Replicator Neural Network (not to be confused with a RecurrentNeural Network – both get abbreviated to RNN). This works on a very similar principal. The input and output layers have the same number of nodes, and intermediate layers have fewer nodes. The goal is to train the network to recreate the input at the output layer – which means it must learn an efficient compressed representation in the lower-dimensional hidden layers. Once the RNN has been trained, the reconstruction error can be used as the outlier score.
- The third unsupervised model uses copula functions to build a joint probability function that can be used to detect rare events.
The whole AI2 system cycles through training, deployment, and feedback collection/model updating phases on a daily basis. The system trains unsupervised and supervised models based on all the available data, applies those models to the incoming data, identifies k entities as extreme events or attacks, and brings these to the analyst’s attention. The analysts deductions are used to build a new predictive model for the next day.
Photo and further references: The Morning Paper AI^2: Training a big data machine to defend