Data Skeptic

Graph Databases and AI

Duration:: 35m
Broadcast on:: 21 Oct 2024
Audio Format:: other

In this episode, we sit down with Yuanyuan Tian, a principal scientist manager at Microsoft Gray Systems Lab, to discuss the evolving role of graph databases in various industries such as fraud detection in finance and insurance, security, healthcare, and supply chain optimization.

(upbeat music) - You're listening to Data Skeptic Graphs and Networks. The podcast exploring how the graph data structure has an impact in science, industry, and elsewhere. - Welcome to another installment of Data Skeptic Graphs and Networks. I'm joined as always by Asaf, my co-host on this journey. How you doing today, Asaf? - Hi, doing fine. How are you? - Doing great. We've got an interesting episode with Yuan Yuan Tian, works at Microsoft and does a lot of work on deploying graph databases in industry. - Tian catches the bull by its horns, you can say, and points to the challenges faced by the graph database industry. Mainly that the industry use of graphs is a long tail distribution, as graphs are, or networks are, themselves. On the one hand, you've got a few giant companies, like Facebook and LinkedIn, that graphs are their business. But precisely because of that, and because of the scale, they build their own database. They don't need the graph database from the database industry. On the other hand, you have a long tail of smaller companies that just want to incorporate graphs into their business processes. They aren't too married to graphs, but they wisely think that they can benefit from applying graph technology. I think the question is, does the graph database industry do its utmost to help these guys, right? They are guys on the long tail. Tian mentions the standardization of graph query language, like the GQL project, is probably a step in the right direction. But I wonder if it's enough, especially when these days we have LLMs that can code with the right prompting. I do believe this is the future, and if I'm right, I think the industry should do more. >> Yeah, I think it's an exciting time in that regard. Relational databases are so standard and so easy in a certain way, not to undermine the efforts that a database administrator has to put in, but it's easy to get it up and running in an organization. People understand it, they come in knowing SQL, and it's easy to adopt in business. Graph databases, in my limited view, I see a lot of organizations struggling and even giving up in many cases. For a variety of reasons, some of them justified some of them organizational, but in each case, like the team was starting from scratch, really, learning a query language that's new, learning a technology that's new, and I think that has historically presented a barrier to entry, really. >> It may be to make the step a little more doable. Maybe, as I said, use probably LLMs that weren't there. It was very hard then, and GQL is a nice step, but I don't know, it's still a new language, right? >> True, and I'm starting to learn it now, and through this process, I had learned Cypher and a few others in preparation, and now I learned, oh, there's this GQL that we're going to hear about in the interview, of course, which seems to be the path forward, which is good. But interesting, you're only about LLMs. I have it on pretty good authority from my colleague, David, who people heard in the recent episode that at least chat GPT-4 understands GQL pretty well. If you give it your schema and tell it in natural language, what you're looking for, it's able to give you the GQL query. So I want to go put that theory to the test a little harder, but that seems kind of promising, right? If maybe my query box, I don't for sure type the query, I type what I want, and then I get to inspect the query, that could be the way forward too. >> I'm sorry to learn that you learned Cypher, 'cause I just said it was the path. So I'm sorry. (laughs) I'm sorry. I always wanted to, but I said, well, now with LLMs, why bother? >> Fair point, yeah. I found it a little cryptic, but I was trying to decide if that was the curmudgeon in me or just what the language is. >> A friend of mine tried to teach me a basic Cypher, and he always said, it's so simple, so simple. And then every time he showed me, he made a mistake. And I said, oh, I know what the problem is. And then he came back, it's not so easy. I'm imagining for a business, running business to do this giant step. It's, I think it's asking too much. >> Well, I think we can jump right into the interview then. (upbeat music) >> My name is Benjamin Tian. I'm a principal scientist manager from Microsoft's Grace Systems Lab. And Grace Systems Lab, or GSL, is actually a applied research lab in Azure Data. So we're sitting at the crossroads of research and product development, applying cutting edge technologies to improve various data services on Azure. So before GSL, I was a principal research staff member at IBM Research. I want to make sure that what I'm going to say only represent my personal view, not the view from Microsoft. And can you share any details about some of those cutting edge technologies you get to work with today? >> So a lot of different things, right? So could be query optimization, workload optimization, right? So in Azure Data, we have a lot of database products, right? When you have a query, we need to make sure you have the right execution plan. You get the best performance. So it's about using techniques to query optimization, techniques workload, optimization techniques to improve these queries or workload. I also work on ML for systems. So basically applying machine learning to help solve system problems, it'd be scheduling how to schedule jobs onto clusters, how to allocate resources to certain data service, how to tune the configuration of a database and how to optimize queries. So for that, we also use ML techniques and of course graph databases. So I'm working with various graph teams inside of Microsoft and also LinkedIn on improving their technologies with benchmarking, graph modeling, so things like that. >> Well, I know from reading your paper, the world of graph databases from an industry perspective, you've got a lot of expertise in this area. When did you get started with graphs? >> So I would say started from my PhD back in 2003. So my thesis is about querying graph databases. So I have done a lot of research in different topics of databases, but sort of graph always had a special place in my heart because it's the starting point of my research in database. For example, back in IBM, so I started this research project on supporting graph queries on top of DB2. It started as a research project, but then got interest from the product team and was successfully tech transferred as a product. Right now, it's called DB2 Graph. So I was basically the technical lead behind the projects. Besides that, I also published in various like social networks, graph processing, graph databases, various topics on graphs. >> Well, we've had relational databases. I'm pretty sure longer than graph databases. Why do we need them? >> I think graph databases provide a very different view of the data. So it's true actually back before graph database becomes a thing, a lot of graph data actually stored as relational data and queried using SQL. But I think graph provides a much more intuitive view for the users to view the data as entities and relationships between entities. On top of that, graph databases provide a very intuitive query language for user to actually explore the data stored in these as graphs. And there are also some other like graph algorithm, like you can run page rank, you can run connect components. Those type of operations are actually pretty hard to express as relational operators in a relational database. So before there's graph database, people usually suck these data out and run the algorithm by themselves by writing actually code and writing algorithm to run them. But now with graph databases, these are all much easier to do. >> So we know that at the storage level, then things are done a little bit differently. What about at the query level? Do I still write select star from? >> Well, they're actually different query languages, right? So on the query language side, you have, if you're using RDF model, you have sparkle. So it's also declarative. It's a little bit having a little bit SQL flavor, but then you have this pattern matching constructs in the language to help you easily. See, I want this type of pattern, right? So this subject has to be connected to that object with this predicate, but that object happens to be connected to the another object through this predicate. So it's easier to express that way. And if you're using the property graph model, there are a lot of languages, right? So there's Gramley, which is a little bit interpretive. And there's a Cypher, it's more declarative. Oracle has its own declarative languages. The more SQL-like telegraph is owned, a declarative language also SQL-like. And now you have the standard GQL, which is a declarative language intending to basically be the standard for all property graph. It has declarative favor. So in a sense, it looks a little bit like SQL, but it's more than SQL-like. So we can very easily express relationship patterns, like star patterns or triangle patterns much easily than expressing them in joins in SQL. And there's another effort, right? A complement to the GQL called SQL-PGQ, which is actually extending the SQL standards with these match the semantics in GQL. So basically to help you easily declarative a certain type of data as a produce a graph view, and then do all these pattern matching on top of that. - So I know back in the relational world, there are subtle differences in the query languages across, let's say SQL Server, MySQL, Postgres. But most people, your colleague tells you the two tricks and you kind of know it. What sort of differences do you see in graph query languages? Did you know that numerous web scraping issues like IP blocks, rate limits, or struggles to stay anonymous and secure, can all be solved with just one tool. That's right, proxies. Data Impulse, a proxy provider, proclaim the newcomer of 2024, proves that proxies may be legally obtained, quick to respond, yet have a modest price tag. They operate on a page-you-go pricing model. So you can focus solely on scraping and you don't have to worry that your traffic will expire. Their support team is available 24/7 and you get responses from real humans, no bots in sight. This provider offers IPs from 194 countries and unique residential addresses to scrape the web without getting distracted by geo-restrictions or blocks. Data Impulse doesn't resell proxies from other providers. Thus, it can guarantee quality while staying friendly towards your wallet. Your reputation is safe with those guys as they have nothing to do with illegal use cases. Go check them out at dataimpulse.com. This episode is brought to you by WorkOS. If you're building a B2B SaaS app, at some point your customers will start asking for enterprise features like single sign-on, skim provisioning, role-based access control, and audit trails. That's where WorkOS comes in. With easy-to-use and flexible APIs that help you ship enterprise features on day one without slowing down for your core product development. Today, some of the hottest startups in the world are already powered by WorkOS, including ones you probably know, like Perplexity, Vercel, Jasper, and Webflow. WorkOS also provides a generous free tier of up to one million monthly active users for its user management solution, making it the perfect authentication and authorization solution for growing companies. It comes standard with rich features like social logins, bot protection, MFA, roles and permissions, and more. If you're currently looking to build SSO for your first enterprise customer, you should consider using WorkOS. Integrate in minutes and start shipping enterprise plans today. Check it out at WorkOS.com. That's WorkOS.com. (upbeat music) So I know back in the relational world, there are subtle differences in the query languages across, let's say SQL Server, MySQL, Postgres, but most people, your colleague tells you the two tricks and you kind of know it. What sort of differences do you see in graph query languages? - Oh, I think before the GQL was released this year, it was a chaotic situation, right? So there was no standard. Every vendor has its own language pretty much and there are certain languages that are being used by more than one vendors, right? So the two are Gramley and Cypher. And then our request is all, typography is all, another says is their own language. So it's really hard for the, I would say for a customer to write their application. But I mean, Gramley and the Open Cypher being more widely, relatively more widely used. So I see a lot of customers actually gravitating towards these languages. I think the hope is that with GQL, this chaos will end and eventually everybody is going to move on to GQL, but we'll see, right? It's just a release. - Well, I certainly like the idea of one query language and I've heard from many places that that's, you know, the new standard, that's where we're going, but that's a common thing for a standard writer to say about their standard. What's the reality in your mind? - The reality, I do see actually a lot of momentum behind the GQL. GQL being the standard, a number of companies actually behind it, right? It's Neo4j TIGA Graph Oracle and there's LDBC, the organization, they're all behind it. I think because they're behind it, they're actually really embracing it, right? So now I see, for example, Neo4j is implementing GQL TIGA Graph is, and I think even AWS Neptune is trying to support GQL. And I just recently saw that Google actually announced their new product called Spanner Graph and they're actually going to just support GQL. They're not even supporting Gremlin or Open Cypher at all. They're directly going to GQL. - That's a heck of a lot of momentum, yeah. - Yeah, yeah, it's definitely momentum, yeah. - Could you share some examples of industry applications? When and why do people adopt a graph database? - Yeah, sure. I think it has a lot of applications in different areas. I definitely, if you mention about graph finance, insurance, healthcare, security, intelligence, supply chains. So these are kind of the areas that use graph a lot. I think the most prominent example would be fraud detection. You hear a lot of graph usage for fraud detection. In the finance domain, it would be basically fraudsters transferring money to beneficiaries through meal accounts, right? So it's not just one transaction. It's basically a number of transactions. So you're trying to find certain patterns, like certain path patterns or graph patterns in your transaction graph that fits that, you know, the pattern and those might be potential fraud going on. So in healthcare, there's also fraud detection, right? So you look at insurance claims, you know, patient records and trying to see whether there's any insurance fraud going on, you know, with the policyholders, the patients with all these service providers. So again, it's a big network and you're trying to find certain sub graphs that fits a certain pattern and then that would be potential fraud. - And how does an organization get into adopting something like that? It seems like an idea, a data scientist would want, but there's a long way off from the idea of an algorithm towards a full deployment of a good production solution. - That is true. I think graph has a little bit barrier of entry, right? So it's not something that normal people would just pick up. I think there are different potential users, right? So as you mentioned, it could be data scientists basically dealing with data on lake or on databases and from databases and doing machine learning. Now they want to try graph algorithm as well. They're really, I hope the graph database vendors would actually see the need and actually provide some entry point for the Python users. And we also have a lot of LCCO users, they're used to just issue SQL queries, now they want to try graphs. And for them, I think the need is to really make it easier for them to transition from SQL to graph. And I see the, for example, the SQL PGQ standard, the standard extension was aimed to basically help this easy transition from SQL user, for the SQL user from SQL to graph land. For adoption of graphs, education is very important for the customers. I think it's still relatively niche area. So really need a lot of education for the users to know, okay, how they can use graphs. And if they want to use graphs, how do they start? And there are several different vendor options out there. How do most, each has whatever their market share is. How do most people decide and choose A versus B? What are the criteria that are important? - Yeah, so I think there are a lot of things you need to consider. One is graph problems are actually greater than just graphs because it's very rare that your whole workload is just about graphs. It's usually a part of the end-to-end pipeline, right? So you definitely need to do some data processing, data ingestion, some pre-processing. And then maybe you do some other type of analysis and then you do graph. And then after graph, you need to do another type of analysis. So it boils down to several things, right? So first of all, is your workload just graph? Or is it more than graph, right? Is it a homogeneous graph only workload? It is heterogeneous workload. Second, different graph databases also have different capabilities, right? You need to consider, for example, what is your latency requirement? What is your throughput requirement? And how frequent is your updates, right? So some graph databases are more targeting batch-oriented updates. Others are better at handling a lot of frequent updates, right? So you need to consider this as well. And then the third one is recency of the result. Quite often, graph databases is not the source of truth. Your data comes from the source of your graph. It's actually coming from some operational databases, whether it's in a relational databases or it's basically a key value source. You get these data and then you basically analyze a little bit and ETL into a graph database. And then from there, you do your analysis, right? So you need to think about how-- is there a little bit gap between or a lag between, you know, what your recent operational data is and what is actually your database, right? So you also need to think about what is the recency? Some graph databases, it's easier to actually do the ETL, easy to ingest data from these relational stores. Others maybe require a little bit more time. So that's another criteria you need to consider. I think some applications like I'm thinking of friend recommendation on a social network, it would be okay to do that in batch, maybe even nightly. Yeah. I mean, let's forget about the cold start problem for now, but other applications like fraud detection, you probably need something real-time or very low latency at least. Exactly. Does that change the entire design architecture? Do I have to make that choice early on? I think so, right? So different graph databases have different capabilities. I think you need to think through beforehand when you choose the graph database. And what about scalability? How do you graph databases scale up? I think most of today's vendors do provide some sort of a scale out solution. Some are better than others. Actually, currently a lot of the graph usage, some, you know, a portion of them probably never need a scale out solution, at least for now. So for that, a single node graph database solution might be sufficient. But I think when people are like, you know, customers are shopping for a solution, they're not just looking at their current need, they're also thinking about their potential growth, right? For that, I do see most customers want to have, at least know that whatever they're choosing have a scale out support eventually. So scale out is very important. Are there any hobbyist options? Hobbiest option, what do you mean hobbyist option? So a developer that wants a weekend project, they've got a little bit of money, but not, you know, VC funding, and they'd like to integrate graph into some project they're working on. - There are definitely open source options out there, Janice graph, even a lot of the graph databases, the commercial ones, they provide some free versions for basically, if you have a small problem, you want to just try on, they all have the option of a free version, like a community version for smaller set up. So I think as a hobby, if you just want to try what it can do, you know, to play with it, I think these are all good options to start with. - When I think again of relational databases, I'm always assuming that there's an IDE of some kind present. You know, I have like SQL Server Management Studio or some console, not necessarily required, but for me it is, I guess. What sort of options exist in the graph database world? - Yes, I think all graph vendors, like the major ones, Neo4j, Tiger Graph, or if you like an article or Neptune, I think they all provide some sort of studio, like a graph studio for people to easily model your data as graphs and easily create data and visualize the results. And there are also some tools out there open source or free version that are just for visualizing graphs. I think you can also try them out, but it's not a studio per se, so you cannot easily model your data in there. You cannot easily expect your queries in there, so. - Are there any theoretical limits that affect a user? I'm thinking of like maybe a graph database optimized to perform a page rank query makes trade-offs where it's not as good for sparse graphs. And there could be some other graph database that's optimized for sparse graphs and isn't as good at page rank. Do any sort of limits from computer science come into play here? - I mean, there are always ways to improve algorithm. Like you said, there's sparse graph, there's dense graph, and you can design algorithm for sparse case and design algorithm for the dense case. Definitely those play a role from the industry side. So it's all about the libraries of algorithm. If you talk about algorithms. So different one does provide a different set of libraries. And if you look at, they have a certain algorithm they support and they might have different versions of the same algorithm targeting for different scenarios. So yeah, these are all provided. - Do you have a sense of any common methodologies? We've mentioned page rank or it could be connected components, but what are some of the most popular methods that get people interested and involved in graph databases? - So I think the, I mean, some of them mostly used algorithm page rank is one, along that line, there's other centrality algorithm, right? So to basically measure how central a node is in a graph, these are common things people like to do to understand in a graph, right? So there are other centrality distance in between the centrality, there's degree centrality and number of centrality measurements you can actually do. Another common, I would say, thing customers like to know is it's like a customer segmentation or you understand how the nodes are grouped together, right? So for that connect component or clustering algorithm are very common. And then let me see, other things will be in a similarity measurement, right? So to see how similar nodes are, right? So for example, if you pick one of the nodes, you want to see how other nodes are similar to this one. So you'd select the other nodes where similar to this one, that's an also a very common sort of operation. And so these are algorithm side. On the query side, I think the common operation would be traversal, right? So we understand the neighborhood, one half neighborhood, you know, two half neighborhood up to a certain size and also trying to find a particular pattern of a graph, right? So a star, star shape, triangle shape or whatever, diamond shape, right? So, or basically a circle or, you know, a loop shape. So these are common operations. - Do you see any common patterns of organizational adoption? How does a graph database get stood up inside of a company? - Besides what I mentioned, right? So these, the areas, like industry areas, for example, finance or security or healthcare. So these you see mostly, you know, these companies are the first to adopt graph technology. Like supply chains, another one. And within the organization, usually it's basically, I see more of the sort of R&D sectors who are the first to adopt a graph, right? Because as I said, it requires a little bit of skill. It's not something that is easy to adopt. And that's why I'm actually hoping the graph database vendors all try to make the entry point, you know, the barrier of entry much easier for customers. - And in a lot of those cases, do you find people are improving an existing business process? Or are they unlocking some new opportunity by introducing graph databases? - I think it's both, as I said before. So there's certain things you can use relational databases to do. It's just very cumbersome, right? So if you're having a particular pattern in mind, you can actually express that graph pattern using joins, but it's actually very cumbersome. It's very hard to understand. It's also very hard to write correctly. Adopting graph makes it so much easier if you know Cypher or GQL is basically just using these arrows and thus it's very easy to see. And immediately you'll see, oh, that's a certain pattern you can basically express. Second, I think adopting graph, right? Running some algorithms or running basically graph ML or some other things. Those are things you couldn't do in our traditional database, right? So these will basically unlocking new applications, enabling new capabilities that was not there before. - So as you mentioned a bit ago, it's not easy to stand something like this up. You might need an R&D team skilled in getting their hands dirty on new frontiers. So I imagine groups like that at some point have to ask themselves the question, is it gonna be worth it? And all these sound like great features, right? We can use in our algorithm, but if those features have no importance then the model's not using them. - Right, right. - Do you have any rules of thumb for how to evaluate and alleviate concerns like that? - I actually think we have some hope with the event of LLM. There are actually a lot of new applications or exploration of, for example, enabling natural language to graph queries, right? So this is similar to natural language to SQL. You can also enable natural language to graph queries. I see new work popping up that trying to basically enabling this. I think it's going to make the sort of the barrier of entry much easier for customers. Other than that, I mean, I see the you know, there are vendors out there trying to, for example, you know, with the SQL PGQ, right? So you can actually start to do that using SQL. And I think there are also vendors trying to make graph easier to access for data scientists, right? By providing libraries in Python, the hope is actually it's going to getting easier and easier for the developers. - What do you see as the future of graph databases? - So definitely think we're living in a very interesting time. I think AI and the graphs right now, for example, I see a lot of a mention of AI and knowledge graph together. And I think it's becoming one of the hottest areas in research and also in general in industry. I think it's going to be making graph much more, I would say important, much more widely used. Now given that it's, you know, it can be maybe very useful in helping LLM based applications, right? So people are thinking about there are several applications, right? So either using, you know, I heard a knowledge rack, right? So you heard about a rack, which is a retrieval augmented generation. And normally people use vector data bases. And now there's a lot of work and say, oh, we can actually use knowledge graph to help do better a rack, right? Because it actually capture the relationship among the different entities. Hopefully it will capture a better, you know, context for feeding into LLM based applications. And there are also applications that, oh, we can now generate the better knowledge graph using LLM. You can use LLM to extract relationship from text data or other type of data and then generate knowledge graph to basic argument here existing in a human curated knowledge graph. I think it's basically very exciting that basically AI and graph is going to be, I think a very interesting filter going forward. - And I imagine with cases where you help a customer do a deployment, some of that, most of the time that's confidential, but are there any particular use cases you can highlight that you think might be novel or interesting to listeners? - I definitely see customers in asking, how can I use my knowledge graph to help with my question on SUNY together with LLM? So that has been an interesting, not just from one customers, and at least the people have been asking about this. - Do they do that for internal knowledge graphs? Or is it something public facing? Internal, right? So it's, I think the RAG approach is really helpful for the free internal knowledge, right? So if it is already public and probably LLM is already know about it, so you don't really need to provide a lot of this extra context. I think it's all mostly about confidential data is from your own enterprise and how do we provide this extra knowledge to augment LLM to help generate the right answers to questions? - When you look at the variety of solutions that are available for graph databases, and I know you've written pretty extensively about this too, what are some of the major categorical differences you see? How do you classify graph databases? - You know, there's a solution space, right? Different solutions for graph databases. So there's basically native graph databases and there's hybrid graph databases. That's one way to look at. There is also graph-only databases versus a converged or multi-multigraph databases. - And why wouldn't everyone go multi-modal? That just sounds like graph database, but a little better. - I think it's because reality, you have just graph-only workloads. You have the data, sometimes you want to view it as tables and query that using relational in the SQL, relational operations. Other times you want to view the same data as graphs, and then you can basically view it as graph and the query is through graph. The unification is another trend besides what I said in the AI. I think unification is another trend, which is you want to have a unified platform where data sits, it's all your data there and you can query it using different ways, right? So you can view it as different using different models and you can query it in different ways. But together, you have maybe one pipeline that contains different operations and that can all be done in this unified platform rather than you going through different systems moving data around to solve the problem. - So again, wearing my relational database hat, I think about like a transactional database is something that's always live, always on, has to be there. And maybe data warehousing is something that can be passive. The data can be at rest. All the batch workloads run and then containers can be spun down. Can I apply the same framework to graph databases? - Yes, you have, I mean, in fact, there are different graph loads. I think a graph workload, I think I mentioned, right? So you have the algorithm you have, you also have these queries, right? So these queries are much more low latency and you have also graph updates in operations, right? So you can think of them like graph LTP operations. These graph algorithm, you can think of them like graph OLAP, right? So they require a much longer time. Yes, I think it's very analogous to the relational world. You have this sort of OLAP low latency requirement for work load and you also have these analytics, right? So it requires long time and get you something at the end of type of workload. - Willow, what's next for you? - Well, yeah, so I'm continuing working with my colleagues from the product team and helping them, you know, generating better graph products for the users and helping the customers get onto these new exciting applications. - Very cool. Well, please give us a tip at the appropriate time when one of those new products is coming out. It's an exciting space. - Yeah, sure, sure. - And is there anywhere listeners can follow you online? - I think the best is actually follow me on LinkedIn or just Google me, Google my name, you'll find my homepage. I do also have Twitter, I don't use it as often. So I think LinkedIn would be the best way. - Sounds good, we'll have stuff in the show notes for listeners to follow up. Thank you so much for taking the time to come on and share all your insights. - Thank you. (upbeat music) (upbeat music) (laughing)