Data Skeptic

Fraud Detection with Graphs

Duration:: 37m
Broadcast on:: 22 Jan 2025
Audio Format:: other

In this episode, Šimon Mandlík, a PhD candidate at the Czech Technical University will talk with us about leveraging machine learning and graph-based techniques for cybersecurity applications.

We'll learn how graphs are used to detect malicious activity in networks, such as identifying harmful domains and executable files by analyzing their relationships within vast datasets.

This will include the use of hierarchical multi-instance learning (HML) to represent JSON-based network activity as graphs and the advantages of analyzing connections between entities (like clients, domains etc.).

Our guest shows that while other graph methods (such as GNN or Label Propagation) lack in scalability or having trouble with heterogeneous graphs, his method can tackle them because of the "locality assumption" – fraud will be a local phenomenon in the graph – and by relying on this assumption, we can get faster and more accurate results.

-------------------------------

Want to listen ad-free? Try our Graphs Course? Join Data Skeptic+ for $5 / month of $50 / year

https://plus.dataskeptic.com

(upbeat music) You're listening to data skeptic graphs and networks. The podcast exploring how the graph data structure has an impact in science, industry, and elsewhere. Welcome to another installment of data skeptic graphs and networks. Today we're taking on a topic I've been eager to take on. Hopefully there'll be even more on this, but today it's all about cybersecurity and how networks can be a tool for finding fraud and crime and other malicious activity. Asaf, is this an area you've worked in before? Yes, and I have some bad news and good news about it. So people often tend to me and say, well, we can use network science and networks to do some anomaly detection. Look for suspicious actions in their organizational IT or network and find out if there's something malicious is going on, but if I can plug the show's name, I'm a bit skeptical. Usually when people do to cyber defense solutions, you know, they want to get the alerts. The first demand that a customer has, a client has about cybersecurity alerts is, don't give me all the alerts, right? Give me it by order. Give me a list, a list of alerts. Usually I guess his second demand is how to turn off the alerts, right? 'Cause you get a long list. And when you dive into the first one and try to figure out what's going on there, all the while the alerts keep accumulating. The good news is that if you transform this list into a network, like let's say an IP network, right? Which IP connects to which IP? And you look at the problems or the alerts this way, the network can easily focus you on the main problems, right? The ones with, let's say with the highest degree or, and you can see which problems are a subset of others, of other problems, right? 'Cause it's a subgraph in the community graph. This is my take about cybersecurity. Yeah, he uses this nice technique, HMIL, the hierarchical multi-instance learning, where instead of having, you know, somewhat traditional data set of everything's labeled, you have a bag of instances and those share a label. Through that technique, I guess is how they achieve scalability, that their graph neural networks weren't scaling well for this particular problem set, and that technique was, which I found pretty interesting. To the best of your knowledge, is there a lot of graph and network effort going on in cybersecurity? It seems it'd be a focus in that industry. I know I advocate for network science everywhere, all in the fraud detection also, and that's why I started Netflix, the network science board, podcast to show the applications that networks could have in every industry, actually. Absolutely. Shimon has a unique opportunity here. He's in partnership with Cisco, so they have a bunch of low-level network routing data that I'm sure is very insightful. I wish I could have got my hands on that. I'm a little jealous of the project he got to work on here. Well, let's jump right into the interview. Let's. I'm Shimon Manlik from Czech Technical University. I also work at Research Lab in Gen, formerly Avast, and I'm pursuing my PhD at the university at the same time. In my PhD, I'm focusing mainly on JSON data. What we would like to do is to have some kind of a set of the shelf algorithms that we have for images, that we have for sequences, that we have for texts in general. We would like to have this also for JSON data. This type of data is quite widespread, I would say, but still it's quite ignored by the machine learning community, by the research community. So we, like, at least during my PhD, I want to fix that. Well, one of the distinguishing properties about JSON data to me is that it's inherently unstructured. There's not typing, there's not a heavy schema to it that could be at strength or its weakness. How do you see it? One constraint that we, or assumption that we work with, is actually that this kind of data has a schema, and we define it in our papers, and it's one of the first things. This schema is important. It's quite loosely defined, say, that there can be missing data. You can have several data types at the same path, stuff like that. But still, you have to have something like that, otherwise it's very, very difficult. Recently, with the rise of large language models, you could attempt to do something like ignore schema altogether and push everything through the LML and see what it tells you. But it also has some problems. So one of them is the context length. Other is the computational complexity of transformers. So it's not that straightforward. And JSON is a hierarchical data structure, which consists of atomic data, which composes, let's say, more higher-level structures, objects that represent some objects in the real world. The schema, as we define it, is, I will describe it as a hierarchical data type. So let's say you have a JSON that describes people. Each person that is described in this JSON can be a list of people, for example. They may have an age, which will be an integer. They have an aim, which is a string. This is how we define the schema. And once you have it, you can build a model that is able to process these kind of structures. Even though we want the schema to be fixed for one particular dataset, like the framework that we research or that we propose in our papers, it can handle basically any schema. Given one particular dataset, you need this dataset to have some fixed schema. But when you switch to another dataset, you can have a different schema. - That's interesting. And then is it that the techniques are universal or the techniques can do an interchange between schemas? - Techniques are universal. So you can reuse a lot of things between different datasets. Obviously, some datasets are more difficult than others. So some are very easy, some dataset. So you can just use default stay and get a really, really good accuracy. - So what are some of the domains you can apply the techniques in? - At the beginning, when I said that this is quite, it's a very specific and you could say niche kind of data. We see it a lot in cybersecurity. So one interesting application is that when you want to learn something about an executable, say you can do static analysis, you can run it in Sandbox and get some information that is what we call dynamic analysis. What is interesting is that all the tools in this space that provide you this, they are called, for example, file info for static or cuckoo for dynamic or GVM-made, these are the Sandbooks, is the names of them. For each executable, they output JSON. This is one big source of this kind of data. What the CyberSec community usually does is that they take this JSON and then they think very hard about what to do next with that. They want to apply some classifier, say, tell the difference between malware and benign files. But if they want to apply some of the shell standard classifiers, like neural nets or decision trees or random forests, they need to have a feature vector, which is a fixed size, fixed length array of numbers, say, or array of categorical variables, but it's only one dimension. This is a really hard constraint on these models. What researchers do in this era usually is that they define this mapping from JSONs to features. And these can be really high-level, really powerful features. But the problem is that this needs a lot of knowledge, a lot of expertise in this domain. And the other problem is that these features can get obsolete very quickly, especially in CyberSec, which is a very dynamic landscape, very dynamic domain. One example here is the CyberSec. And what you can also do is to take our framework. Our framework, which is called JSONgrinder.jl, written in Julia language, you can apply it on this data. And you can pretty much out of the box, it's like 20 lines of code, and you get a pretty good baseline without anything that I've just described. We also have not only these outputs from Sandboxes in CyberSec. Also, the paper that we are discussing today is another example of a nice application of our HTML framework. So let me see if I've got the cybersecurity example right. You've got a suspected piece of malware, and there's maybe these common sets of diagnostic tools that are run into Sandbox, and see if it does something suspicious. But what I'm picturing coming out of that is probably pretty raw data, logs and stuff. I don't imagine the machine learning is at that layer, that they're these one-size-fits-all tools. What does the raw data look like, and how do you turn it into a feature vector? In today's digital age, the sheer volume of personal information scattered across the internet can be daunting. I've been there. That's why I was intrigued by Delete Me's approach to digital privacy. What sets them apart is their flexible, user-centric process that puts you in control. One of Delete Me's standout features is their customizable privacy protection. When you sign up, you decide exactly how much information you want to protect. Start small, if you prefer, then expand your protection as you witness their effectiveness firsthand through detailed removal reports. Their service goes beyond one-time removals. Delete Me actively monitors and eliminates any new or recurring data throughout your subscription period. Their team of privacy experts handles the complex removal process with hundreds of data brokers, making digital privacy protection effortless for you. Experience peace of mind knowing your online privacy is incapable hands. Visit Delete Me today and take control of your digital footprint. Keep your private life private by signing up for Delete Me, now with a special discount for our listeners. Today get 20% off your Delete Me plan by texting data to 6400. That's data, D-A-T-A to 6400. Message and data rates may apply. (upbeat music) - What does the raw data look like and how do you turn it into a feature vector? - The point is that you don't. That's the main selling point of the HTML. HTML stands for, by the way, a hierarchical multi-instance learning. You had to turn these logs, these, they are very rich. Some examples, when you run some binaries through Cuckoo, you can get files that are 10 megabytes of data. So it's really, it's large, yeah. With the current approaches, you would have to do something like this. You would have to transform it into a feature vector. But what we do is something a little bit different. We just read the file, we represent it, we infer the schema, we get a data set, we read 100 samples and we infer the schema so that we know these JSONs, they have here and when we're talking about the Cuckoo sandbox, we get, for example, opened files. During the execution, we get where the file writes, we get what's Cisco's, the executable does. So everything like that, we can get into our schema and then load the document into an in-memory hierarchical structure. And what follows is special hierarchical model, which is based on multi-instance learning. There are some projections with a couple of neural net layers, there are some aggregations and we basically have some sub models, it's all very hierarchical and recursive. So we load the, we load the document into the memory and protest it as such and we get one vector as an output, which you can consider, I don't know, as an embedding and you can slap another classifier after that and train it. So it is another term that we use in our papers that like HTML is basically a learnable embedding for JSON files or other similarly hierarchically structured data. - So you've ingested it, you've inferred the schema, you have essentially structured data at this point. How do we get to an embedding? Is it through the traditional techniques or is there something new about it? - This is the multiple instance learning part. So multiple instance learning is quite old, already an idea. In the very beginning, multiple instance learning was an extension on standard machine learning. So in standard machine learning, you have this one fixed size vector and in multiple instance learning, you get a set of these vectors. You get a set of these vectors and you get one label for this set of vectors. The set of vectors is called a bag. It's called a bag and he wants to predict one label per bag. So instead of a dataset that consists of collection of vectors, you get a collection of sets of vectors. And what multiple instance learning does is that it researches models for this type of data. But as I said, this is quite already quite an old idea and it's been since extended to hierarchical machine, hierarchical multi instance learning. You not only can have a bag of vectors or a set of vectors, you can have a bag of bags of vectors. You can recursively compose these concepts together. Essentially, we got to our JSONs because these are nothing else, right? - Is it the case that the bag is known to all share a label or is it more like voting where you're picking the most appropriate label given the population in the bag? - And it depends on the dataset. So in early works in this space, they sometimes assumed that, yeah, each, and I didn't mention that these vectors in the bag or these elements in the bag because in hierarchical multi instance learning, it doesn't have to be a vector. It can be something more complex. These are called instances. So the elements of the bag are called instances. And in early works, it was assumed that each instance also has a label. But you do not observe these labels. You only observe the label of the whole bag. Nowadays, it doesn't really matter, you know, because we have neural ads, you can approximate any function. It doesn't matter whether there's assumption or not. So you can, for example, when you have a list of people, each person can be classified into some category and then they can work together to get the label. Yeah, that this could be some underlying function that provides the label that we want to approximate. Yeah, it could be part of it, yeah, very possibly. And also the same could hold for the sandboxes that I talked about. So you could classify cis calls. So these calls to the core OS, OS functionalities, could classify each of them with some severity, whether if it's very, very suspicious or not, you could then compose the global label out of these sub-labels, let's say. Yeah, it's very, very, very likely something that is happening under the hood when we are training the models. And another part of our models and also what multi-instance learning does is that it needs to somehow aggregate this information. Aggregated, so we need to design aggregation functions more often than not just mean or max is enough. And here you can, this is basically what you described with getting one global label out of several lesser labels. - Well, I've been led to believe that in cybersecurity, one of the most useful types of features is a frequency counter. So how many times has this person failed their password in the last 90 days is zero, but they've failed at 100 times today. That's pretty suspicious. So I know features like that are very present in cybersecurity where they're just counting how often something happens or have they seen you from this location before. Does your vector sit next to data like that in a deployment or do you kind of subsume and automatically figure out frequency type features? - The model can definitely learn to count something if you have an array of things in your JSON or in your document. It is very, very easy for the model to learn to count these things. Practically speaking, when there are a lot of these things and you just need the counter and just the frequency, it can be quite wasteful honestly. Like if you observe thousands of events per day and you need to just remember their number, it's really an overkill to train a machine learning model to do that. In all of machine learning and in CyberSec especially, it's all about some trade-offs. So you have some domain and knowledge, you know which works, which things work, which things don't. So you put some of this inductive bias into your models. This is what feature engineering does. And there's a wide spectrum of what you can do from really taking your time and effort to design a powerful high level features. But this is very expensive to just collecting the data and letting machine learning do its work. So we always need to decide what we want to stay. And basically what we do with the HTML framework is to provide another point of the spectrum. So it's less manual, labor intensive than designing the features. But you still get some of the machine learning magic that you really like, yeah. - So through this process you've, I believe the output then is these vectors where you have a really nice embedding that describes the data considering its hierarchy and its structure and all these sorts of things. What's your next stage? Or maybe we should jump ahead and say what's the objective? What are we trying to infer from the data? - Yeah, it depends. So as I said, you can regard any HTML model as a learnable embedding. And you can do a lot of things with the embedding. So you can glue a couple of neural net layers at the end and learn some classifier. You can train not only classification but also regression. You can consider these embeddings as some latent space depending on which loss function you select. You can also train a very nice latent space model. It's quite similar, I would say, to what standard machine learning does. For images, we have some latent spaces that you train. You can classify images, you can generate images. So we can do all of that. Basically, HTML framework is just a step from a document to this vector. - So how do graphs play a role in the process? - One obvious thing is that you can consider each to JSON document or not only JSON but also XML and other types of these documents. You can, these are trees, basically, from graph point of view. This paper that we're talking about, it's about very nice application of HTML to graph data. What we did is that we get some data from Cisco. These data, these were graphs, basically. And what you can usually do with graphs is that you can also apply machine learning. There is a rich array of methods for that as well. One of the most prominent are graph neural nets and models like that. We tried a slightly different approach and we had several reasons for that. The first one is that the data was large. Like exceptionally large, we couldn't even dream of applying any kind of these graph neural nets, which are really powerful but are better suited for really smaller types of data. If I should describe this kind of data, it's basically just a very, very simple observations. And even you can see it as binary relations. So you get a set of clients, a set of computers in a network and you get a set of domains, let's say a second level domain. So these are the domains that the computers, the clients connect to. And you can collect data at the edge of the network, of the local network and you can just write down to which domain, which client connected. So you will get these pairs. And you can also not only observe clients, you can also observe binaries. So executables, you can write down this executable, you can write down some hash of the executable and you can remember that. So at the end of the day, you are left with a really, really long list of these connections or of these informations. It's just binary relations, just pairs. From this, you can model higher level patterns of behavior of these objects in the network. You can observe that the two clients connect to the same domain, et cetera. You can already derive rules like that. And this leads to representation as a graph. You can transform this into a graph and this graph is still very, very, very, very huge. Classical standard graph neural nets didn't can't really be applied here. So what we did is that we just applied HML here or the HML architecture. Yeah, the difference between graph neural nets is that graph neural nets, they apply message passing over the whole graph, whereas HML had just views the graph as a kind of database. So what we describe in the paper is that you, let's say we are interested in one vertex in the graph, which represents one client, for example, or one domain that's better, one domain. This domain can be malicious or can be benign. We want to decide it. We want to construct a classifier that would predict that. With regard to graph neural networks, is it correct then to say that the scalability of message passing is one of the main limiting things where that technique can't tackle the problem? Yes, yes. What GNNs do is that they apply several steps of message passing and that is really, really expensive for this type of data. We had graphs or we had domains represented as vertices in these graphs that had hundreds of thousands of neighbors. You can't compute that with GNNs. So we took a different approach here. So you have this very interesting topology as a graph, but it's not obvious what's malicious and what's not. How do you get some labels here? Yes, that was a different challenge that we also had to solve when writing the paper and that this, and this is a challenge in all network security, is that you usually get some kind of denialist, which is curated by your analysts and you want to extrapolate from this denialist. So what we know when training is that we have this graph and we know that some of these vertices are malicious domains and we want to take these as labels. It's a problem that in machine learning is called positive unlabeled problem because we know some vertices in the graph are malicious, are positive, but we know nothing about the rest. So the rest is highly likely in the nine because if you take random vertex from the graph or from the internet, random second level domain, it's likely the nine because of the prior, but we don't know that for sure. So that's one problem. Another problem is for example, that you really have to be careful when evaluating your algorithm. So let's say that we have a domain abc.com and you have the F.com and you know that the first one is malicious, the other one is not. And you train your model on a week of data, for example, and then you want to test your model on a next week. So you collect the data and you predict the label for each of the domain represented as a vertex in this graph, but then you need to be really careful if these two domains that you had in your denialist, if these appear in the next week of your data because you must be careful to learn the specifics of malicious behavior and not the specifics of this domain instead. If there is something really specific in the abc.com domain, if there is something really specific, you can learn that, your model can learn that, but it's not really useful, you know, because other, these malicious domains, they pop up and disappear very quickly. So you really need to learn the behavior. And we also employed a technique to test that the model is immune to this problem. - And then the model's output, what does it give you as it's labeling the new vertices, just a binary classifier? Is there a probabilistic score? - Yes, it was binary classifier with some confidence. Yeah, you can be good. - And then I'm picturing the graph of anyone who's on the denialist is known to be a malicious node. And then somewhere between zero and 100% of the other nodes are gonna get labeled malicious. What is the output? Could you give us just some sort of summary analysis on the frequency with which you discover it or something along those lines? How much malicious data is there to be discovered? - It's usually less than a percent of all the incoming domains. This is one possible application of this model or of this technique is that you can use it to generate possible malicious domain candidates. So you can run your model and take the top, say 500 domains daily and send it to your analysts. And you can largely reduce the chunk of work they have to do. They can just focus on these very suspicious domains. - Well, in a general machine learning case, it's very nice when you can look at your holdout data set and say you have type one errors and type two errors and F1 score and these sorts of things. But you have a somewhat unique challenge in that there isn't a perfect ground truth. How do you look at performance metrics? - There are some specifics that apply to whole cyber tech field. One of these is that ground truth is expensive. I wouldn't say that it's not of a high quality. There is ground truth, which is available, but it's sparse and scarce. So you need, for example, when classifying images, everybody can tell whether an image is a dog or a cat. But to classify a domain, you need to have professionals, experts with their tools. It's sometimes very hard to tell, especially if you're not like core cyber second analysts and you're doing machine learning, it's sometimes you need to cooperate with others. So this is something which makes this domain more challenging than I would say images, for example, because when you're developing a classifier for images, you can always check whether it makes sense. You know what the image shows. I wouldn't say that in CyberSec, we use different set of metrics. It's pretty much the standard and the standard set. And in this case, where the prior of malicious domain is slow, so there are much, much, much more dp9 domains than not, you can look at it as some augmented retrieval. So precision and recall are perfectly valid metrics here. What I would say is that you can't, in CyberSec, false positives are really expensive, mistakes to make. What we usually do is that we plot ROC curves, and we look at areas with low false positive rates. Low false positive rates. So usually plot ROC curve with logarithmically scaled x-axis and you are interested in this area. But these are also quite standard metrics, yeah, I would say. Yeah. Do you think it has a path to industry? Is this something that will improve the quality of cybersecurity? Yes, definitely, definitely. This was, as I said, this was developed with Cisco. So this was tested on real world production data. So I think the potential here in this regard is huge. Could you outline, maybe, for listeners' sake, some of the wins that'll come there. I would imagine there's accuracy gains from new techniques, but I think there's maybe even some benefits in being able to quickly create a new embedding rather than hand-coding features. What do you see as the major wins for getting this into the cybersecurity world? There is a stark improvement with respect to the previous state-of-the-art algorithm which was used for this kind of problem, which is this algorithm is called probabilistic threat propagation. And even though it works quite well for some problems, it has some serious limitations. So, for example, what probabilistic threat propagation can't do is to employ multiple binary relations. I spoke about clients and domains, then about binaries and domains. You can observe multiple such binary relations that work with them. Probabilistic threat propagation can only process one of these binary relations simultaneously. We employed more of these relations, so we took, if I remember correctly, like 11 of them. So, these were clients, these binaries. We also had some TLS certificates. All of this, we put this into the framework and this dramatically increased the efficacy of the model, it was like threefold, I think. So, this was a very, very good improvement in efficacy, I would say. And also, it was really, really nice that we were able to apply neural nets and these modern techniques from machine learning to this kind of a problem. - The hierarchical multi-instance learning then had a good success in this domain. Are there any other areas you look forward to maybe applying it to in the future or perhaps other projects where you currently are? - That our plans currently and the plans of our whole group is to aim for cybersecurity. So, currently, for example, we are preparing a paper for usenix, which is not, which is rather a cybersack conference rather than a core machine learning conference. Cybersack is a really, really fitting domain here. So, we would like to first get feedback for our work, obviously, and also to show our work to the world. - With regard to Jason Grindr that we talked about earlier. Could you share some details on it as a tool? Is it something you think becoming a project that other people could use? Maybe something open source or what's the state of it? - We have two main libraries, which are open-sourced completely. They are written in Julia, Julia language, which is very similar to Python, so anybody can try them. We have two components. One is Jason Grindr, which, as the title suggests, can process these JSONs. The Jason Grindr can infer the schema and prepare the JSONs into the structures that I described so that the model can process them. And then we have something which is called Mill JL, multiple instance-learning library JL, which is another library that takes care of this processing. So, you can build a model there and process these data. These are two core libraries that we have. We are preparing the third one, which is about explainability. So, nowadays it's really important to be able to explain the results of your model. And we have some tools in machine learning, but they have some big problems. For example, that they do not really faithfully describe what the model actually computes. This is still an open problem, I would say, in machine learning and not fully solved at all. And it turns out that the way we process the data, essentially skipping the vectorization of the data, the feature vector mapping, it's a really great advantage for later explainability because you have the whole input data at your disposal and you can create explanation out of that. So, this is the first, the third library, which is it is open source, but it's in development currently. It's called explain the mill, and it will be for explanations. We have extensive documentation with a lot of examples of anyone who has some JSON documents at hand can try. - Siwan, what's next for you? - Yeah, I want to publish more in the cyber tech space and to promote our work. And obviously, finish, finish, finish the PhD and then we'll see. - Very cool. And is there anywhere listeners can follow you online? - I have LinkedIn profile and I also believe on Google's color. So, I'm there yet, other than that, I don't use social networks that much. - Google scholars are always a good spot. Thank you so much for taking the time to come on and show you work. - Yeah, thanks for the invitation. It was a really nice talk. (upbeat music) (upbeat music) (upbeat music) (upbeat music) (laughs)