Data Skeptic

Network Analysis in Practice

Duration:: 29m
Broadcast on:: 14 Oct 2024
Audio Format:: other

Our new season "Graphs and Networks" begins here! We are joined by new co-host Asaf Shapira, a network analysis consultant and the podcaster of NETfrix – the network science podcast. Kyle and Asaf discuss ideas to cover in the season and explore Asaf's work in the field.

(upbeat music) - You're listening to Data Skeptic Graphs and Networks. The podcast exploring how the graph data structure has an impact in science, industry, and elsewhere. - Well, welcome to the very first episode of Data Skeptic Graphs and Networks, our new season on the graph data structure and social networks and all things that go along with it. I will be joined on this journey by my new co-host, Saf Shapira. Saf, can you introduce yourself to the listeners? - So, hi, very honored to be here. I'm a Saf Shapira and I'm a network analysis consultant. - And what does your work entail? - First, I don't know if there are many network analysis consultants. Well, that's one of the reasons I started Netflix, the network science podcast. I am working on to lure people to this field. What does a network analysis consultant do? So, recently I'm working with a company called Exposed that takes down inauthentic malicious activity in social media. I am helping them to analyze the social network and find both farms or other malicious activity through a network structure. That's what makes the network analysis unique. We usually don't look at the context or the semantics of the network, but on its structure and derive interesting things from it. Some of my first instincts are like look at inbound and outbound connections. If I have a lot of friends that are mutual friends, that seems more realistic than someone with a thousand friends that are all disconnected. Is it features like that or is it more algorithmic to find these farms? - Oh, so specifically in the case of the bot farms, I used a similar idea where I used the page link. In algorithm, I think you talked about, I think a few years ago, right? In this case, we found a giant bot farm with the company. The bots were doing some malicious campaigns and social media. In this case, we found some suspicious users, probably bots. Usually what people do, they extract features. They look for specific features and try to find some more using machine learning on things like that. But what we did, I think, was much more powerful. That's where the page rank idea is coming. So just a quick reminder of what page rank is. So you assign each node in the network. It gets a basic score. And then it lends part of the score to its neighbors, the connecting neighbors. So in the case, as you said, the outgoing, in-going directions in the network in a directed network, the more neighbors that point at you, the bigger the score you get. Makes you more important in the network. The simple logic behind it makes it so powerful, right? 'Cause you can't fool the algorithm by creating lots of edges. It only counts the incoming edges. In this case, if you want to get heard on social media, you need either lots of friends pointing at you or some huge hubs that can let you have this score. Following this idea, we followed these bots followers who pointed at these bots as users. What we discovered was that there's the giant bot farm that its main purpose was to lend their score to the bots. So they'll get traction. This way, we caught the entire network, the entire bots network, that seems luckily was in its early stages. So we even caught bots that had zero activity. Probably they were either ready for deployment or they were just befriending other bots to increase their page rank score for the social media algorithm. So it was like tens or hundreds of thousands of bots, in this case. So there's a scale of, I guess, the impact of efforts like that on the very doomsayer end, one might be concerned that social network manipulation could sway an election. On the other end, oh, it's just like a spam email. It's annoying and I just delete it and it goes away. How serious of a problem do you think things are out there in social media? - First, I should say, I'm not a social media expert. I'm a social network analysis expert. So I don't care if it's social media or other network. I look at the structure of it, not the content. I don't know how influential are bot farms, but it's something we just don't want to have on in the public domain, right? It's malicious content, content, we don't want it. That's, I think, the main idea. Speaking of social media and elections, so you put it in good use, but I don't like the manipulation through the networks. Okay, that's my moral stand, but I do believe you can use network analysis to better understand your audience. Okay, for my politician. I even did some consulting about it for someone running for mayor and help find what are the key issues that interest the locals and who are the influencers that they can date you the approach and how the opposition is doing. And all this by understanding the network structure, in this case, the social network structure, who are the influencers, the hubs in the network, how does the network build? How are the clusters? What are the clusters in the network? And which subjects are important to the people they represent? In order to understand how to do that, you need to understand how networks are built. What's the universal structure of networks? Can you expand on what you mean by universal? What made me fall in love with networks in the first place is that networks have universal physical laws that govern them. This is a revelation, right? It was the basis of the network science. So network science was started at 25 years ago, give or take, started by a physicist who started researching the phenomena that was giant networks. In the early 21st century, we had the worldwide web, Facebook, LinkedIn, et cetera. And physicists got interested in what are these networks? What are their characteristics? In networks, there are just a few hubs in the network that have, let's say, thousands or millions of edges. And most of the nodes have just a few edges or none at all. This distribution is called a long tail distribution or heavy tail distribution. You can find this phenomenon each and every network, real world networks. I mean, networks will see in reality, right? Not the generated networks or so. It doesn't matter if it's a social media, if it's a cyber network or a computer network, whichever. Each network is a long tail distribution. That's one universal phenomenon we see in networks. Another is communities. That means that in networks, there are dense clusters of nodes that are loosely connected to other clusters. I should point out that in random networks or graphs, there are no clusters. Only in real world networks you can find it. Community structure is the hallmark of real world networks. There is no scientific description of a community except that there are more edges within the community than to its neighbors. So it's a very loose definition, but it gives rise to a wide range of community detection algorithms, like the Louvain algorithm. I think you covered more than 10 years ago, if I remember correctly. For listeners who weren't listening back then, can you give us a quick refresher on what the Louvain community detection algorithm does? It compares the edges within, let's say, a suspected community to the edges in a generated random network with the same degree distribution. If it finds that the number of edges within that community exceeds the number of edges you might find in the random graph, that means it's a community. Okay, 'cause we can only find communities in real world networks, not in random graphs. So if it's more than random, it's a community. Okay, so that's one of the loose definitions of the community. The beautiful thing about the community structure from an analyst point of view is that the communities are no coincidence, okay? There's a reason behind them. As we said, it's a universal phenomena. There is physics involved. Usually the reason is homophily, okay? So people or nodes congregate in the network if they are similar or share a similar trait. This is very useful to know because it can help you label the nodes and communities based on this homophilic trait. I can use it, the community detection algorithms, to detect different communities on the network I'm analyzing, and then I can label the entire network to different areas of interest, and then I can easily find which area interests me, and then I can analyze this specific community. I don't need to go over all the network to understand what's going on. - What are some common uses of these communities once you've detected them? - Part of what I'm doing, I'm doing ONA. This is an organizational network analysis. I like to use a community detection there, mainly because it's not used enough in this field. Lots of organizational researchers or developers, they use network analysis or social or organizational network analysis, and usually what they do, they are looking for the influencers in the network. What community detection does, it shows you how the network is really structured, right? 'Cause when you think about an organization, you think about the organogram, right? There are different departments and so on. And so the intuition of someone that is thinking about his or her organization thinks that, well, of course, each department is probably a cluster or a community by itself, and so there's no need for community detection algorithm. It's sometimes true, but sometimes not, okay? And the interesting parts are where you can find the dissimilarity between the communities in the network and the organogram. So I had an interesting case where I used community detection algorithm on an organizational network, of who knows whom, it was an informal network. In this case, when I applied community detection, I found three clusters. And it was very easy to label these clusters because of the homophilic trait of communities. One cluster was of the management, upper management. One was the lower level employees. And another one, the smallest one, was of what I called an 80. What's interesting that the organization wasn't thinking about itself as having a cluster of management and a cluster of low level employees. The organogram was made of branches. They had the north branch, central branch, south branch, and so on. It didn't fit, but what the network analysis showed was that people in management talk mostly to people in management. And the lower level employees talk mainly amongst themselves. And it's a problem with the organization because they want people from management to talk with their low level employees, giving them directions. And the low level employees need to feed management with relevant data and so on. And they did talk, but not enough. Not enough to make the communities the same as the organogram. The most interesting part was the third cluster, the smallest one, and I called it the 18 because it worked parallel to the organization. And I knew it was the 18 because they were central players in the network. The most central players or hubs in the organization network were in this cluster. They had a very high in degree, okay? That means that many people pointed at these employees as their informal friends. If I'm the head of the organization, I can say, well, they work as I wish they do. And I want the entire organization to work like this 18, okay? So and learn from this 18 how to work and spread it to the organization. And without network analysis, they couldn't have found it out, okay? They needed to find it out, but they couldn't 'cause they didn't apply network analysis before. - Well, depending on who the narrator is, some people would start telling the story of graph theory with Euler and his seven bridges problem. And computer science, maybe you start with Alan Turing and Alonzo Church, debatable, maybe it's Babbage. Where does the story of network analysis begin? Did you know that numerous web scraping issues like IP blocks, rate limits, or struggles to stay anonymous and secure can all be solved with just one tool? That's right, proxies, data impulse, a proxy provider proclaimed the newcomer of 2024 proves that proxies may be legally obtained quick to respond yet have a modest price tag. They operate on a pay-as-you-go pricing model so you can focus solely on scraping and you don't have to worry that your traffic will expire. Their support team is available 24/7 and you get responses from real humans, no bots in sight. This provider offers IPs from 194 countries and unique residential addresses to scrape the web without getting distracted by geo-restrictions or blocks. Data impulse doesn't resell proxies from other providers. Thus, it can guarantee quality while staying friendly towards your wallet. Your reputation is safe with those guys as they have nothing to do with illegal use cases. Go check them out at dataimpulse.com. In 2023, just 10 vulnerabilities accounted for over half of the incidents responded to by Arctic Wolf incident response. Wouldn't you like to know how to take them off the table and make life more difficult for cyber criminals? That's just one of the essential insights you'll find in the Arctic Wolf Labs 2024 Threats Report authored by their elite team of security researchers, data scientists and security development engineers and backed by the data gained from trillions of weekly observations within thousands of unique environments. This report offers expert analysis into attack types, root causes, top vulnerabilities, TTPs and more. Discover the attack vectors behind nearly half of all successful cyber crimes. Why ransom demands climb 20% from 2023 and find out why 2024 will be an especially volatile year for cybersecurity. I got my copy, "You Should Get Yours" by going to arcticwolf.com/dataskeptic. That's arcticwolf.com/dataskeptic. (upbeat music) - Well, depending on who the narrator is, some people would start telling the story of graph theory with Euler and his seven bridges problem. And in computer science, maybe you start with Alan Turing and Alonzo Church, debatable, maybe it's Babbage. Where does the story of network analysis begin? - Okay, so first, Euler, the seven bridges of Kenigsberg, it's a wonderful story. I use it sometimes, but it was like hundreds of years ago. And I think it's symbolic of network analysis in the way that nobody uses it. Okay, it took like 200 years for someone to do something with what Euler did. So it's, as I said, it's a nice example. From my network analysis starts with the network science and the universal, universal law. Although I didn't mention SNA or social network analysis that comes from social studies. And it's been around for like, at least 100 years and so, and also one of the first ones to practice it was the Moreno and Helen Jennings. That was his, well, she was his issue, she was his colleague, but didn't get enough credit for her work. And now we're fixing it. Okay, Helen Jennings is his colleague. They called it social grams of connections between like students, pupils in the classroom and found out all the universal laws that I'm talking about. They found about them in the 1930s. Okay, 100 years ago, they knew that when you look at social networks, you will find a few hubs in the network and the long tail, they knew about communities. They noticed these phenomenas, but people didn't listen to them much 'cause Moreno was a very, I'm sorry, annoying person. Okay, absolutely, I didn't know him personally, but that's what I've heard. And it didn't stick. So it took about, I think, about 70 years or so for a Barabashi and the other physicists to discover the universal laws again. I hope I'm making justice to the social network analysis and researchers and to the network science researchers. Hopefully one day they'll join together and make a giant community. My hats off for network science, I think. And it started about, as I said, in the late 90s or so with the discovery of the long tail distribution and community detection and post it a small. Well, if we draw an analogy to like large language models, why did we get them now? Some of it is GPU and computing scale and cost. Some of it is the data and some of it is we didn't have the transformer algorithm before. Are there any breakthroughs like that that have moved network science forward? The bigger the networks, the more discoveries we had. If we wouldn't have the worldwide rate of giant networks, I don't know if physicists were drawn to this idea of researching the network and so on. Good thing they did, but I don't know if data or technology is what challenges network analysis. It's hard to explain to people, not because it's complicated, right? I think networks are very simple. You know, I don't code, okay, I don't do code, but I still analyze networks and I find it very simple to do. The hard thing is to explain it to people. It's like trying to explain a house to a two-dimensional person, how a three-dimension world looks like. I remember myself, okay? I had a hard time believing that networks are so awesome and it's a magical thing and you have universal laws. I don't know, it sounded weird, right? The long-tail distribution, I don't know. I don't know anyone who has just two friends on Facebook, right? I know people that have hundreds of friends. The reason I don't know the long-tail of Facebook I took with some time to understand is because they have just two friends, right? So what are the chances that one of them is me, okay? It took me some time to understand, but when I saw it with my own eyes, it was amazing. So I think there's lots of mispotential. I can give, like, two examples. During COVID, COVID-19, I myself worked with the health ministry and tried to convince them to use networks for contact tracing. I think it could have made a huge difference and I failed to do it, okay? And I think the reason for it was that usually when we look at a problem, we look at it as a least, not a network. We usually ask for lists. We ask for the most important things first. But when we look at it through networks, it's very easy to see in its context. Let's say I have someone that infects lots of people. Is it important, right, or she is important? Are they in the heart of the network, okay? Can they affect others? Or is it just a small place? They infected them and that's it. They won't infect others, okay? Network gives you context. And context is not utilized. Usually, and we don't think about it. Another one was with the organization network analysis. I think it's an under-appreciated field because we all live in organizations, right? 'Cause we think we know what organizations are and we say, well, we have an organogram, right? We know organizations, right? We know who's calling the shots. Networks are a way to x-ray their organizations, see how it really works. But people are having quite a hard time thinking that what they see is not what actually happens. It's a very big leap of faith that people need to do and most of them won't do. - I'd like to put you on the spot to estimate where we are on the maturity scale of all of the organizations or researchers or companies or whatever that could benefit from this analysis, how many are embracing it or what percentage about do you think? - Well, you really put me on the spot. Of course, not enough, I think it's like, I guess, one digit. I think one digit, the percent uses network analysis, truly uses network analysis. Like two years ago, there was a convention with Parabashi, Parabashi was great, we said. I asked him with this question, okay, I said, everyone knows what AI is. Even people that never used AI before, it was like a few years ago. Yeah, not today, machine learning, okay. People knew what machine learning is, even if they didn't know, they knew the concept. Why do people don't know what network analysis or network science, why don't they know? It's a branch of science, why don't they know it? And what Parabashi said, you know, people use it, they don't know they use it, okay? So it was kind of a cobalt, but when we said like, when we talk about CO, it was during COVID-19, we said people talk about the SIR model, SIS model, okay, all these models of infection and to break out of the network science wall, right? 'Cause all these infection models. So what he said was that the network science probably doesn't have good PR, okay, but it was reduced to this. So I'm doing my small contribution with Netflix, okay, I'm trying to, we have like a small community of people that are into network analysis and so on. And I hope it will grow 'cause it's a wonderful field, even if you're not in a computer science. That's my reason. - So tell me a little bit about Netflix and how people can find it online. - So Netflix is on all podcast platforms. Well, the context is how to use network analysis in practice from industry and to everyday use. It has an English version and funnily enough, a Hebrew version. My last episode, I was talking, well, actually it's my Hebrew episode. - Well, if there are any bilingual listeners who speak Hebrew, let's definitely get them to it. But I don't know what percentage. - You learn Cypher, so what's Hebrew for it? You know, what's Hebrew, okay? - There's no vowels, so that's a stopping point. (both laughing) - You know what, I have an episode, an English episode that I think people would like. It's about the Dunbar number. Okay, so no spoilers. Like my last episode is about the wood wide web. I concept about the network of trees and spoiler alert. It doesn't really exist. Okay, so I won't say it's a lie. - I read a whole book about it, so this is new. - Why, by Simalt? - Per answer. - To some market the author. It was sort of an armchair science book. I actually listened to on Audible. - Okay, sure, so if you want me to talk about it, I would love to. - We'll link it in the show notes. So say no more, I expect listeners to check that out as well. You said something earlier that caught my attention that you don't consider yourself a coder, don't code much. That being true, what are your tools? - Actually, I have a few episodes about network analysis software. And what I'm most proud of is that I used MCL. It's like Markovian, something, something, community detection algorithm. I used Excel to do it. I ran an algorithm using Excel and I was very proud of it. But usually most sane people use network analysis software. I use mainly Gefi. It's a bit, it's not for the beginners. Okay, for the beginners I usually suggest just, I don't know, tens or say hundreds, but lots of web softwares you can use that just you can just upload an Excel and do some basic algorithms, use some basic algorithms and you'll get a result. When I said, I don't code 'cause I think that coding is not the main problem when you analyze networks, the main challenge is how to tell the network story. That's the question. Okay, let's say I use page rank. Okay, let's say I use Louvain or other algorithms. What do I do with the results? How do I understand it? How do I convey it? Okay, to people. I think that's the main challenge. - Well, maybe we could wind out the episode by brainstorming a little bit on topics we wanna cover this season. There's so many. Anything top of mind for you that you think listeners should be learning about as we get into the weeds and these subjects? - First, what's your view? 'Cause I understand you have your own way of doing things and I totally respect it. So what are your thoughts and I'll back you up? - Well, people who listened to the Q&A episode from last year know we've primarily done algorithmic guest finding, which is probably gonna be a bulk of it. So if you've ever recorded already, we found that way. Thanks to the archive, this amazing resource of pre-print research literature that has an API so I crawl it and scan it as you know and certainly that'll be useful but I don't want us to feel confined to that because not everyone's published on archive and there are stories we'll miss. So it's also nice to come at it. Even when I go to search, it's keyword search for me. I'm doing it with some topics in mind. There's so many things under graphs and networks to analyze, well, at some point have to pick and choose. - Northern networks, I think it would be a very interesting subject. We know so little about how our brains work. I think we should ask people how network science can help us find more, to understand more. People say it's not the most complex network there is. So maybe we should aim high and cover this subject. - As you mentioned, sociology is a big source of some of this and I'd love to get into use cases there where not just okay, somebody did some analysis but how has a sociologist put these methodologies to work for the betterment of culture and society and these sorts of things? I think there's a lot there we can cover. - So I think political polarization would be nice, maybe in the social view. There is social networks for good, you know, to help like older people and lonely people. Someone approached me and asked me how can he use social network analysis to help his patients. He had PTSD patients and the most interesting thing was that what hurts them the most is loneliness. Not the PTSD, the loneliness. And he wanted to create a network of patients and he wanted to ask me how can I do it? How to use a network analysis in order to do that? Maybe a network in sports, okay? It's a big network in sports, my initial thought was soccer because I did some soccer analysis but there's a e-games. I think it was like two years ago there was a very nice paper about e-games that what makes a good e-game team. What they discovered was that if you add to your team someone who battled against you, ran against you and you engage him with your team, put him on your team, you get a better team from a network point of view. - I know, it's kind of nice but I think e-games or sports could be interesting. - Very neat use case, yeah. I'm interested in disease transmission as you'd mentioned, fraud as well, another big one that I think we could cover on a couple of episodes and maybe some of the engineering side of it. It's non-trivial to do some of these operations and how to interact with graph databases or set this stuff up. And lots of cool algorithms like Louvain that we've mentioned in PageRank. So plenty of material to be covering for sure. - Sounds like a great season. - Absolutely, I'm certainly looking forward to it and I'm glad to have you here helping me with it. Well, I'll ask you the same question I tend to ask every guest, is there anywhere listeners can follow you online? - Yes, I have Twitter account, the Netflix P and I hope you'll put it in the show. - Definitely there, or you'll be putting it in show notes. (laughing) - Oh, and of course the Netflix podcast. - Let's include that as well. And we'll work with Vanessa to get something on the homepage too. So in time, that's gotta get cleaned up and spruced up for a new season. Well, I'm very excited to stop. Thanks again for taking the time today. - Thank you. (upbeat music) (upbeat music) [BLANK_AUDIO]