Recent advances in artificial intelligence, especially in deep learning and other machine learning approaches, are really exciting for the future of security. In the rush to roll out AI in security technology, it is easy to forget that machine learning is just a tool, and that like any tool, is the most effective when used by an expert.
“We don’t have artificial intelligence, yet,” Raffael Marty, a security analytics expert, said at the recent Security Analyst Summit. “In no field do we have AI.”
We don’t have autonomous systems capable of the kind of learning and decision-making that we typically associate with AI, Marty said. AI can beat humans playing Go, help scientists design more effective drugs, and make virtual assistants such as Siri smarter. A common mistake is to conflate machine learning, deep learning, and data mining, or to just lump them all into the AI bucket. Machine learning refers to a program that can describe the data, and deep learning is a newer kind of machine learning algorithm. Data mining lets analysts explore the data.
“Just calling something AI doesn’t make it AI,” Marty said.
There are two main ways to use machine learning to identify malicious behavior or entities: supervised and unsupervised. Supervised machine learning is great at classifying data as good and bad, such as malware identification and spam detection. Unsupervised machine learning is more suited to making large data sets easier to analyze and understand, such as analyzing DNS traffic, curating threat intelligence feeds, and handling lower-severity security events so that threat analysts can focus on high-severity incidents.
It’s increasingly easy to implement “AI” into security because machine learning algorithms are readily available. Think TensorFlow, an open source machine learning framework, and Torch, an open source machine learning library. Amazon Machine Learning is a managed service for building machine learning models in Amazon Web Services. But if we don’t have experts who understand the data and know how to use the algorithm, we wind up with lots of results and very little insights.
There are many examples of machine learning where the algorithm learned something, but not quite what it was supposed to learn. Beauty.AI was supposed to be the first international beauty contest “judged by machines,” but algorithms were biased against contestants with dark skin. A study by Carnegie Mellon University researchers found that online ad algorithms were gender biased, as significantly fewer women were shown online ads for jobs paying more than $200,000.
“Algorithms can be dangerous if you just download the library,” Marty warned.
The bias didn’t come from the algorithm itself, but in the data. Doing supervised machine learning well requires a large collection of training data. Algorithms tend to make assumptions about the data and how they are distributed. The assumption is that the input data source is providing clean data, and that there are enough representative data. They don’t do well with outliers. If the data collection isn’t big enough, then there’s no guarantee the algorithm learned correctly.
Machine learning is effective in finding spam and malware because there are millions of good and bad samples to train the algorithm. This is also why Marty advises against using machine learning on network traffic to detect attacks. There are no good training data sets for these problems, and without good training data, there is no way to train the algorithms.
The process is just as important as the algorithm, if not more. Don’t use machine learning if there isn’t enough labeled data, of if there aren’t well-trained domain experts and data scientists to engineer good features. Cleaning the data set, training the algorithm, and making sure the features being used are appropriate, are all important.
You need to be able to engineer good features and understand what was actually learned. “This isn’t just something you roll out of the box,” Marty said.
While unsupervised machine learning is great for data exploration, they require careful attention. Using clustering and association rules can help group related pieces of information together, but are limited when it comes to detecting anomalies. Some types of information don’t work well with distance functions. For example, port numbers, IP addresses, and process IDs look like numerical features, but they are all very different.Perhaps the results are influenced by a misconfigured system and not an attack. Analysts need context and domain knowledge to understand the algorithm’s results.
“Stop throwing algorithms on the wall [and see what sticks]. They are not spaghetti,” Marty said.
Marty offered some advice for security practitioners trying to solve security challenges. Work with the domain experts to gather the right data and identify the right approach, he said. Use machine learning only if there is a large volume of well-labeled data. Use visualization tools or other methods to verify the models, and have feedback loops to collect information from users. Experts should be able to supervise the algorithm.
“Start with the problem at hand and choose the right approach. It’s hardly ever machine learning,” Marty said. “Don’t start with algorithms.
Header image by Jehyun Sung on Unsplash.