Data Skeptic

Graphs for HPC and LLMs

Duration:: 52m
Broadcast on:: 29 Oct 2024
Audio Format:: other

We are joined by Maciej Besta, a senior researcher of sparse graph computations and large language models at the Scalable Parallel Computing Lab (SPCL). In this episode, we explore the intersection of graph theory and high-performance computing (HPC), Graph Neural Networks (GNNs) and LLMs.

(upbeat music) - You're listening to data skeptic graphs and networks. The podcast exploring how the graph data structure has an impact in science, industry, and elsewhere. - Well, the paper we're gonna feature in today's interview is called the Demystifying Change Trees and Graphs of Thought. We got the graph in there, that's for sure. Did you have familiarity with this idea of chains and maybe graphs of thought before listening to the interview and perhaps reading the paper? - Not so much, because, well, LLMs are new, okay? I use Cheche TPT a lot, but not enough to delve into the mechanisms behind it. - I think there's a really interesting opportunity here, like a gold rush kind of opportunity. Obviously LLMs are a big step forward. They're easy to use via API, you don't have to build your own, you can, for a penny of use or less than that, you can use an API to kind of chain them together in some way. So it's not that the LLM spits out a brilliant response right away, but you say do step one, now take that output, loop it into step two, and to kind of orchestrate repeated calls to an LLM. Seems to be a very good opportunity right now where someone with not even a lot of technical skills could build some sort of mechanism based on an LLM that spits out something at the end. What's interesting is that what much it suggests is even a better process that he calls the graph of thoughts for LLM, so pick out graph of thoughts in his paper. He also uses an interesting related concept, which is hypergraph of thought. Although they are not mentioned in this episode, I think, the concept of hypergraphs that he talks about in the paper and his paper of course is in the show notes, hypergraphs are made of hyper edges, meaning that the edges connect more than one node or they connect a few nodes simultaneously. For example, the one too many broadcast is a hyper edge. In this case, if we dissect the hyper edge to individual edges, we'll miss the special simultaneous connection that this group has. So it'd be an interesting path that much suggests to tap into the potential of hypergraphs of thought to improve LLM's. - Yeah, I think there's a big opportunity here around how you orchestrate them and the graph is the most robust data structure for doing it for sure. - Let's take it back to graphs. I wanted to talk about graph database. You also, we start with it. - We'll get into all that. We start the episode with some discussion. Machia has an extensive experience working with graphs, so we touch on graph databases in a few other details. Did you have any takeaways from that segment? - At the point I would like to emphasize and Machia's work is use of network's laws to make parallel computing of graphs more efficient. In this episode, it mentions the heavy tail degree distribution of networks, which is the common feature of network data. It relates to the network science discovery that many features in networks are not normally distributed but are heavy tail. For example, the degree distribution degrees the number of neighbors or connections the node has. So intuitively, we might think that in networks, some of the nodes have many connections, some have a few connections, but most of them would be close to the mean. But actually, most of the nodes would have only a few connections and only a few so-called hubs would have many connections. So understanding this network law helps to efficiently batch the graph for parallel computing. And I guess we'll hear about it in this episode. - Yeah, it's just part of what we touch on, but I think gives us an insight into topics we're going to see throughout the season that sometimes doing graphs at scale requires a large amount of compute or HPC high performance computing as we hear about here. Well, let's jump right into the interview then. - I'm Machia and I'm a senior researcher from Scalable Parallel Computing Lab. So I'm a researcher who is primarily focusing on different parts of the high performance computing landscape. A lot of it went towards so-called graph computing, which is dealing with different types of irregular computations. Irregular means that we are trying to model a selected part of reality with nodes and edges between these nodes. So you can think of it as entities and their relationships. And this paradigm can be applied to a lot of different parts of what we are doing in computer science. And more recently, we've been trying to apply or actually applying this paradigm to make more robust, more effective, more efficient generative AI designs. - Well, why do graph problems in general require HPC? Why can't I get them done on my local machine or maybe a little cluster? - I mean, I would say that a lot of problems beyond graphs as well require HPC, right? I mean, if you think about modern LLM landscape or broader generative AI landscape, this all requires HPC to a large degree, right? In terms of graphs, they are interesting within HPC, but also beyond HPC within HPC because these problems, many of these problems are very practical. And we have a lot of data that has to be processed and this data is just too large to be fit into a small system that you are using. But also, even if the data is small and there are also such cases that the data is small, but it also does require a lot of compute because it has high computational complexity. And we have algorithms that enable, for example, to scale this out very well. And then again, we need a lot of resources to make it finish in some reasonable time. This is why graphs are interesting for, or within the HPC domain, but also they are interesting beyond because we have a lot of research also going on. And for example, if you look at computer science, where we are studying the underlying foundations of basically compute and graphs are a very good abstraction for modeling, many of these fundamental problems that give us some light on different parts of the computational landscape. - So in that environment, are you working on some of the like classic ideas that a computer science undergraduate would know, like all pair shortest path? Or are there algorithms that would be a little foreign to such a listener? - I've did a lot of different things. And both in this kind of very, let's say, let's call it sophisticated things, but also in this very basic one. So the basic ones, I would say you can go probably more basic than, or you can go more basic than let's say graph traversal, like breadth for search, but there are not too many things that are more basic. And this is, for example, what we focused on over a couple of years. So there were like problems like that. Then on the other hand, there were also problems that were a bit on the other part of the extremity which is basically graph databases. So these are very complex workloads which are going way above the simplicity of, let's say, basic graph algorithms. And also they touch on different parts like also system design, high performance system design, especially data management. So these are kind of, let's say, the other extreme. And then there's also a lot of different things, let's say, in the middle of the complexity of the computational landscape. And these are, for example, graphing our networks. Like when you think about it, they are quite simple in terms of what they are doing as algorithms, but they kind of touch upon a couple of different things. So you have to train them. You have to somehow run the inference. You have different types of optimizations. You have different types of compute patterns. So they are kind of, let's say, more complex than a simple BFS. But at the same time, at least in some aspects, because not all of them, but in some aspects, you could think of them as, okay, they don't have, let's say, transactions or stuff like that, which means that they are maybe a bit less complex than graph databases. At least in terms of the system design. Because then they come with things like training, friends, you know, budgeting. So they also come with a lot of different, interesting concepts to be handled. - So there are other areas in computer science that benefit from what some people call being embarrassingly parallel. And then you can just, you know, throw it up in the cloud on 10,000 machines and solve your problem in parallel. Do such luxuries exist in graph problems? - Oh, yes, it does. But it also, I would say, depends on your data. Like the graphs that we are dealing with in most of the, let's say, or many of the industrial data sets that they have the so-called heavy tail degree distribution. That means that some entities that we are processing. So for example, some people in a social network or maybe some, I don't know, some hops, some airport tops or some other entities have many connections to other entities. But then you have also many entities that have very few connections to other entities. Like you have people that have very few relationships or maybe in, let's say, molecular databases, you'll have molecules that have not too many connections or some proteins do not interact with too many other proteins. And when you think about it, this brings us some potential for applying like massive parallelism but in different ways, right? So now you could think of, okay, I can now grab many of these low connectivity vertices and kind of process them in parallel because they have few connections, so I can kind of grab them all in a batch and somehow apply some compute in parallel. But at the same time, you could think, okay, now I can grab one of these very high degree vertices and apply all of compute to its neighbors in parallel, right? Now, if you have, for example, breadth-first search, this will enable you to visit your neighbors very fast. Or if you have a graph neural network, if the vertex has many neighbors, you can, for example, if you are, let's say, doing some more complex variant of your graph neural network, like let's say, message passing your network, you have to apply some compute on each of your edges, right? So that means you can now grab, let's say, these 10,000 edges and do this compute in parallel for all of them. - You know, a graph data structure is a very abstract idea. You could make a lot of contributions as just sort of a mathematician or computer scientist, but you could also be applying that to a specific domain or a specific data set. Are you close to any particular area in that way? Did you know that numerous web scraping issues like IP blocks, rate limits, or struggles to stay anonymous and secure, can all be solved with just one tool? That's right, proxies, data impulse, a proxy provider proclaimed the newcomer of 2024, proves that proxies may be legally obtained, quick to respond, yet have a modest price tag. They operate on a page-you-go pricing model, so you can focus solely on scraping, and you don't have to worry that your traffic will expire. Their support team is available 24/7, and you get responses from real humans, no bots in sight. This provider offers IPs from 194 countries and unique residential addresses to scrape the web without getting distracted by geo-restrictions or blocks. Data impulse doesn't resell proxies from other providers. Thus, it can guarantee quality while staying friendly towards your wallet. Your reputation is safe with those guys, as they have nothing to do with illegal use cases. Go check them out at dataimpulse.com. You know, a graph data structure is a very abstract idea. You could make a lot of contributions as just sort of a mathematician or computer scientist, but you could also be applying that to a specific domain or specific data set. Are you close to any particular area in that way? - This is a very good question, and I would say it also depends, and I would also disagree with some of the foundational assumptions that you might, and this is because, yes, in many cases, you can just abstract your, let's say, real world data set into a pure graph structure, and then you have your plane structure, which is vertices and edges, and then, yes, you can apply this paradigm to different domains, and we applied, I mean, we tested our schemes on different data sets ranging from biological networks via social networks, via publication networks, citation graphs, even to ecological or economic networks, a lot of different things, but then when you actually dive a bit deeper, you will realize that there is a lot of additional data, and actually, by the way, graph databases, they have nice models to deal with that, so they, for example, have the so-called label property graph, where you can attach arbitrary, additional reach data modeled through key-value pairs or through labels to both your vertices and to your edges. You can take your vertex, and you can say, look, I apply some categorization to this vertex. For example, if you have a citation network, you could think of having categories of papers, like technical reports, the wrong art proceedings, conference proceedings, and other things like white papers. This could be like some way to labeling your vertices, but then you can also apply, let's say, key-value pairs in the same way, so you could now say, okay, there's now this vertex will have some abstract, right, because a paper has an abstract, or maybe we'll have some equations there that you could also, like, additionally, just attach to this vertex, right? And then you have a lot of these rich data sets that you can model through this perspective of both vertices and edges, but also through this rich additional data. It gets a bit more tricky when you think about, okay, now how do I model this? Well, now I need to also consider this data set, so actually it's not really that abstract, but then there's actually an interesting way to also make it also, in some way, abstract. So what we are, for example, doing in one of our papers called neural graph databases, what we did was basically trying to merge the two worlds of graph neural networks on one hand that are usually dealing with these more, like, abstract graph structures. And on the other hand, we have graph databases where we have these very rich data sets, very rich in labels and properties, and, you know, additional data that you need to kind of model in your graph database. And then what we are doing there, we propose an encoder called LPG2vec, where it basically takes all this rich data from, you know, these labels, these properties of arbitrary type, and encodes them into embeddings, then we can then use, basically, seamlessly with different types of GNN of graph neural network models, right? So in that way, you can also then, again, take this rich graph data set and kind of make them, again, more abstract, but this time you have, you know, these vertices, these edges, and your embeddings for the vertices and the edges. But coming back to the question in terms of the domains, because I think it was like a very extensive, very extensive answer. So I wouldn't say I'm close to a particular, like, use-case domain, because we've been doing a lot of different domains over the years and trying, you know, data sets over biology, over chemistry, over social networks and others, but I would say that most of any of you are dealing with data sets from biological networks, from social networks, and from citation, or general communication networks, like, for example, citation networks. - When the topic of graph databases, when I look into it, I see quite a bit of variety, at least in contrast to relational databases, where pretty much everyone has centered around sequels, the query language, and even, you know, the language is very similar, or the same from different database systems to others, and the feature sets are pretty similar. Do you have a sense of why there's such variety in graph databases? - That's a very good question, and I'm not sure if I have a good answer to that. I would say there's going to be a lot of different causes. One of the causes is probably the fact that the graph landscape in general is extremely rich in terms of workload that we are dealing with, and we are actually in the process of even discovering these workloads. And if you take a look at the general, let's say, history of graph databases, how they evolve, their designs are sort of following the workloads that are kind of emerging in graph processing. So that means that what would kind of be observed at the very beginning was like the focus on one class of workload, which is like online transactional processing, right? So most graph databases focuses, or have been focusing on these kind of short, like short-term queries where we kind of insert an edge, or like insert an element, or, you know, like remove a vertex, or like these very, very fine-grained queries. These are kind of the, this is like the bulk of what we have in general databases. But then what you observe over the years is that more and more designs have been emerging to also provide focus on analytical large-scale queries, like, for example, running a large-scale page rung, or BFS, or something else, right? We kind of discovered that these are the workloads you also care about. And then they kind of transitioned to graph databases, because of course it's much more complicated to put these workloads in context and graph databases, because then you have to deal with not only how to run these things efficiently on parallel systems, but also how to deal with transactions, with ACD guarantees. Now you have, for example, your short-term queries, together with your very complicated, like large-scale bulk queries like BFS, or page rung, or something more complex. But then, for example, you observed over, more recently was that there was another line of research in a graph computing called graph mining. So we kind of realized that we also care about algorithms where we kind of look for specific patterns in a graph, right? So let's say you want to find all triangles. This is like one of the simplest possible patterns you have in the graph, or maybe then subgraphs, or maybe you want to do some click mining, or, you know, stuff like that. And these workloads, on the other hand, I mean, they have a lot of parallelism inherently in them, but they are much more complicated than, let's say, page run on BFS. On the other hand, they are also sort of in the class of these analytics, because these are, they are not really modifying the graph, but they are kind of reading, but it's again very much more complex kind of class of workloads. And now we again have these things emerging in graph databases. There are, for example, standardizations of these workloads happening these days. And then you have, at the same time also, you had also graph neural networks emerging. Now you have also designs into, you know, neural graph databases. So we are kind of in the process of discovering these workloads. And then they are kind of finding their way into graph databases. And it kind of adds up to the complexity, because now, for example, you have one system that people spend a lot of years of implementing and designing and maybe optimizing for specific cost of workloads. And now you hit, okay, now we need to add BFS or GNNs, or, you know, something like that. And now this is a massive code base, and it's like a massive system with fixed abstractions, with fixed way of dealing with, and now it may be very hard to throw in something new, right? So then, of course, what do you have? You'll have companies emerging. Oh, let's have a system that is doing that, and we just code it from scratch, because maybe we have some, you know, some framework ready, and we can now add something on top of that to make it a database. This could be one cause why there are so many different designs out there, and new ones are coming up while graph databases, like general databases, focused on SQL. There was no such standardization until very recently for graphs, right? So we have GQL now ongoing, but it's like, yeah, I mean, this is only now. It was basically impossible to have a system that is kind of adhering some standard, because there was no such standard. Like, it was more like autokey. These people think that this is important, the other company wants to grab some other part of the landscape, and yes. And they're all important in a way, right? So that's why they all came to be, and they all had chances to succeed, basically. - So there are very many interesting questions one could look into related to graphs without necessarily involving large language models. Although these are a hot topic, they're not obviously connected. At what point did you start to see the possibility of connections? - I mean, when you do a lot of graph science and graph computing, you kind of learn to see these patterns almost everywhere. And it's not that you are forcing yourself to see it everywhere. It's because I would say the world is inherently connected. If you think about your life, you will probably think in many ways about your life in terms of some entities and the relationships, right? You have family members, you have, like relationships are kind of the basis of how we function, how we think, how we model the world in the first place, right? So this is basically anywhere we can throw in the graph, the graph where you're thinking. When it comes to LLM, it sort of became pretty much obvious from day zero when I started to read about prompting, about reasoning of LLM's in more general, it was kind of natural that, okay, you could apply graph way of thinking to many parts of this landscape. The first design that we did there was a design called graphofots. This was kind of the obvious thing to do. So when I saw the chainofot and the treeofot designs, I was kind of like, okay, this is the obvious next step to do is to do the graphofot because like in a chainofot design where you are kind of stacking the LLM reasoning steps in a chain to kind of give this tool a chance to improve on its own output, you are forming kind of this enhancement style of transformation, you have something, some entity that you can iteratively improve over the next series of steps. Then the treeofot was about, okay, I want to explore a couple of different possibilities at every possible step, right? You have also something in between just called self consistency of chainofot and this was kind of like exploring different parts, different chains of reasoning. And then she ofot kind of generalized to exploring different variants of reasoning at every possible step. But then there is the thing, okay, you could now also merge these outcomes into more powerful outcomes, right? And this is basically what kind of drove the design of graphofot on, okay? We want to apply different ways, different transformations to LLM outputs to LLM plots in a way that you could not only enhance them, you could not only explore them, but it could also merge them into more complex, more powerful outcomes, right? And then of course, you could think about this from other perspectives. So you could think of, okay, you could apply graphs also at the fine-tuning stage, you could apply, because again, you may want to teach the LLM how to resume some relations. You could apply graphs at the rack stage because now you may want to think, okay, how do I make our retrieval better by teaching it that there are some inherent relationships in the data sets, right? That I can also then use to fetch more relevant ones to produce hallucinations. You could apply this paradigm to many, many other parts of the LLM landscape as well. - Well, could we maybe start with or go back to chainofot and just walk through a simple example for listeners who aren't deeply familiar with it, what might be a practical chain of thought problem that an LLM could help with? - When you give, like as a human, right? If you are given some examples to solve like even mathematical tasks, right? You have a simple logical task or simple mathematical task about, okay, I have X apples and my brother has Y apples and now you know how many apples we have together or then maybe my wife has some other apples and you want to kind of do a couple of art medical operations. It's much easier if you see a sequence of how this is being solved, right? So you can see, okay, I apply step one, I apply step two, I apply step three. This is how working for examples work. This is like the foundational paradigm for basically doing almost any type of problems, solving any type of problems. It turns out that if you apply this to LLM they will also kind of catch up on these examples, right? So if you're now in your, in your in context example, so in the part of the problem where we are specifying the examples, take a couple of these, let's say puzzles or I don't know, maybe finding shortest paths on the graph of my subway connections. If you actually give it these smaller steps it will be able to mimic them because it has been heard knowledge about doing that based on the training process and it will mimic these examples. It will actually, it will show that it will, it will like the success rate will be much higher. And now there are kind of two, let's say two ways or more ways or explaining that but one way to explain why this is more powerful is that it's again through this kind of working through examples that it was taught in the training data set in the past. The other way to explain this or let's say the conjecture they explains why this is so powerful and more powerful is that because it just gives the LLM more flops to kind of operate on and to provide the better quality answer. If you just ask it, okay, what is the answer to question X? If it just gives you the answer to this question, then it will, it will spit out a bunch of tokens so it will do some amount of compute but if you now force it to go through these steps, it will actually have more flops, more compute amount to be spared on this problem, right? Because it will do more tokens, right? So it will have more chances to do something more useful because it has more resources to be used. This is basically the chain of thought and why people think it's as successful as it is. - It makes sense then why we would look at a tree 'cause there's different paths we could consider and if we're gonna look at trees, we should look at graphs, of course. Could you elaborate more on what a vertex and an edge are in this context? - When you read the titles, chain of thought or geofot or graph of thought, you kind of assume these are very similar things, right? But now with you really going to details, these are actually, this is actually a bit more tricky. So in the original chain of thought paper, it was all about like a steps of reasoning within a single prompt, right? So you give this thing your prompt with this in context examples, which means you are giving this thing already a couple of chains of thoughts as examples, right? So your prompt that you are sending to the LLM already contains two chains or k chains of thoughts where k is number of your examples, right? And then in chain has some number of thoughts in it, right? 'Cause you have these chains of reasoning. And then the LLM will start to spit out these thoughts as kind of these steps of reasoning in its own, chain of thought that it is producing and as an answer to your question. And that's basically what a thought is in that context of the chain of thought or at least the original chain of thought paper, is that, okay, these are these partial steps of reasoning that the LLM is giving, right? So that's chain of thought. Now, if you move, for example, to the tree of thought, you will actually realize that there are, like at this point, there are different tree of thought designs already, like lots of follow-ups, and this is what we also kind of, so we have one more paper that we kind of try to categorize all these different things, right? So this is why also I took a deeper look at that because we wanted to make it more precise as whole naming. And we kind of realized that in some tree of thought papers, a thought is basically, you know, like a prompt, like a whole message we are sending or receiving from the LLM, but then in the other works is actually also like a very small partial step of reasoning in one potential chain in this tree, right? Or the path of this tree, right? So this is basically what we try to make more precise in this paper called just demystifying chains, trees, and graphs of thought. What exactly a thought is, and what exactly like the connection between these thoughts is. And then, so the answer is that it really depends on your design, and then in some of these works, it's like a small step of reasoning, and this step can be as large as a whole prompt, right? So for example, in graph of thought designs, you could see the similar, you could see something similar, a similar kind of way to diversify these designs. For example, for us, a thought was like the outcome from the LLM, but this outcome could be again, of different granularity, right? This could be like a single step of reasoning within a single prompt, but it could also be like a whole, whole message from the LLM coming to you, right? So in graph of how to kind of provide a way to parameterize this and to kind of control this as a user. So you can kind of control this, and this depends on your problem on how you want to specify this. But in general, let's take the way to nicely, potentially nicely define it, to say that this thought is like the execution of the LLM with some input and some output. But this is kind of one way to cover all these different cases, right? So you could think, okay, if one whole thought is the whole outcome to your question from the LLM, then again, you have some sort of execution of your underlying engine with some input and the output. If it's like a small step of reasoning, you could also say that, okay, this is the execution, it's like a smaller execution, but again, it does have some input because you have no, your input, kind of the LLM is grabbing internally, and then you can see the output is basically the next step of its reasoning. - And then what are the edges represent? - Oh, these are the dependencies between these thoughts, right, these executions. This is, I would say, a bit simpler to define. It's like in a chain of thought, the edge is clearly the dependency in the previous step of reasoning, right? In a tree of thought, same thing, right? And in graph of thought, again, these are dependencies, but then in a graph of thought, you could have incoming edges, or general edges coming to a single vertex from different types of predecessors, right? So that means you will have different dependencies for a single vertex, for a single thought. - So, you know, we want to sort of leave the definition of a vertex a little open, maybe it's the output of the LLM, maybe it's the embedding or something like that. - And now we are getting into a more, about, okay, please finish the question then I will. - Well, I'm curious if then, how does that get incorporated? How did the incoming edges get utilized by the downstream vertices? - This again depends on your context. I mean, context use in the meaning of what you are doing, not in the meaning of the context of the LLM. Because for example, in some designs already out there, you can see that a vertex is treated more like the execution of something that could be LLM, but also can be something else, right? So you have designs like Jupyter Swarm, where, you know, vertices are more broadly defined rather than just purely LLM, let's say, let's say inference step, depending on the context, right? So if you are speaking about this broader definition where you have, let's say, some framework that is using underlying LLM, or maybe a team of LLMs for doing something useful, then there's downstream definitions or there's downstream, you know, being fed to a vertex, kind of being controlled by maybe some ecosystem on top of the LLM, right? So we'll have some control system that's actually seeing the state of these vertices and seeing how they are kind of executing what is their status, some of these vertices, or most of them will be maybe some LLM executions, like inference, but then maybe others will model something else, and then you have these dependencies kind of explicitly controlled. But then when it comes to chain of thought, right? I mean, this is done by the underlying inference engine, right? I mean, you spit out the next token, you grab this token, and then you fill it back, and then you do it again and again and again, and this is basically how it propagates, right? So that's then like the LLM itself is kind of the way that you are doing it. - So as I'm sure you know, all trees are graphs, they're a subset, not all graphs are trees, though. What advantages can you achieve by moving to a graph structure? This is what I already briefly mentioned before, is that you can combine outcomes from different intermediate steps of computation into more powerful outcomes. What we did in graph of thought was to apply this paradigm to very simple foundational problems like sorting, or intersecting sets, because we wanted to just see, okay, if we could apply this way of reasoning for the simple problems that are very foundational, but also to more complex problems like NDA, synthesizing, or creating some documents. In all of these problems, we were able to encode these problems through this paradigm of, okay, we want to dismantle this problem into subproblems, we solve these subproblems, and then we merge these subproblems into outcomes to solve finally the very final problem that we had at the beginning, right? And in that paradigm, this segregation operation is something that you want to use, right? So you want to kind of take in the middle sets of computation, the smaller steps, and then you want to combine them to achieve the larger goals, or the more robust goals, or the more powerful goals, right? Is it combining different outcomes into more powerful outcomes? That's something you can have with trees or with chains. - So if I want to take maybe a use case of plan a trip, involves flights and accommodation, and certainly could be many steps in an LLM's process, how does the graph structure get developed? How do we know where to begin, I guess? - So that would depend on what part of the problem we are modeling, because now we could think of it, okay, I will start from a very high level, I want to plan a trip. Planning a trip requires doing some things that are very heterogeneous, like I need to book, let's say my hotel tickets, but like hotels or, but also I need to book, let's say airplane tickets, I need to kind of synchronize different parts of the trip. These are kind of heterogeneous tasks in a way that they are kind of different when you think about them, right? So like booking a hotel might be similar to booking airplane tickets, but you still need to use different websites, or maybe the same one if you're kind of doing this collectively. But then for example, we want to kind of synchronize these things together. This is like a very different type of task. When I would be doing this with LLMs, I would be thinking, okay, how do I model this whole problem from let's say top down? What are the kind of bugs, the bug parts of the problem? And then I would think, okay, do I want to apply the graph to all of them at the same time? Or do I maybe want to kind of treat them separately at the high level, because that makes more sense? And then maybe I want to apply graph to some smaller parts, right? So let's start from the last part. The smaller part could be, let's say that your trip is complicated, and it involves a lot of stages, but of maybe means of transport that you could manage with the same tool, right? Or the same, let's say, data structure for the scheduling. And you may want to have some constraints, like I want to visit these 20 different locations, and you want to make your trip efficient, right? So now what you may want to do is, okay, I want to try different variants of this ship, because I may want to try, maybe, you know, I want to maybe visit Taj Mahal first, or maybe I want to go to Paris first, and depending on I want to try two different variants to see, because maybe, you know, at that day, this one particular objective place is kind of booked out, so maybe I want to kind of try different variants to see if I can make trip more efficient, and I don't have to be stalled somewhere. So now you have different variants to be tried. So you could, of course, paralyze this. So this could be, for example, different chains. But then, within a single of such chains, you could say, okay, the chain can be long, and processing the whole chain of different, these attractions may be too much for the Olemma, right? So now we may want to split it into sub-problems, and then you want to solve these sub-problems separately. But then you need to merge them into a final outcome, because you still want to have the final schedule, not just a small part of a schedule, right? So then you are going to grab these smaller chunks and kind of put them together into the final schedule for your trip, right? And then this could be one chain that you are trying, right? But you may then have multiple of these chains, and then you may want to kind of merge them as well, because you want to see which one of them is best. Or maybe I have now two chains that are kind of maybe similar to each other in some way, but they are kind of maybe both of them are sub-optimal. So again, I want to aggregate them to see if the Olemma can produce one final chain from like schedule from these two schedules that these two chains are giving me, and kind of have the best one of those two, or K that if you have more of them. But this is now applying this graph of what the composition and kind of aggregation to, let's say, scheduling your trip in terms of, okay, I have places to visit, and I have like maybe some time constraints or some place constraints, right? To make it more efficient. Maybe aeroplane tickets were kind of done in a separate part of this, the composition of the whole problem. So now you could think of different ways of applying this decomposition at different stages from the topmost to the bottommost, but again, they do boil down to decomposing the problem to sub-problems and then kind of merging them to get the best of all of them. - Well, when I think about tree of thought, I see obviously a simpler data structure, but that has some advantages like, in terms of like a search, if I'm looking for a viable solution, some trees, maybe at a maximum depth, we say, "Oh, this has failed." Then I abandon the branch and I go somewhere else. And I feel like if there's a solution to be found, maybe it'll take me a long time looking for it, but probably I'll get to it. Whereas in a graph, perhaps I could get stuck in a cycle or something like that. How do I, I guess, guarantee that a graph-based graph of thought solution will halt and give me an answer? - I mean, one, the simplest way to do it is just to make sure you don't have cycles, like make the graph a cyclic, right? You can have a directed graph which is a cyclic and it's still more than a tree. You could just treat this at the very modeling part of the problem solution, right? Okay, I will enforce that there are no cycles. And this could be a very nice way to do it, because you could then model your cycles, maybe through more, more directed edges that are merging, but then you have this kind of power to maybe cut it off in a more effective way. But the other part of the solution, or let's say to, of my answer would be that maybe some problems are just better with trees. You know, it's not that you need graphs everywhere. Like trees are easier to deal with and in some of the problems like search problems, trees could be better. If you don't have this need to maybe combine different parts of the problem into something bigger, then why not just use trees, right? Well then, as you point out, some cases maybe you'd use A versus B, how do you measure effectiveness towards deciding what's the right solution? - I mean, there are a couple of the basic metrics, right? So the cost in tokens, like the ultimate, the dollar cost, is one of them and you want to make your solution, you know, costless, obviously. They don't want to overspend. And the performance, right? I mean, how fast can you get the answer? For example, in our paper, you show the chain of thought is usually the cheapest solution because it's simple, right? You just have a chain of reasoning and I mean, how long can you make this chain? I mean, obviously, you could also make it more expensive by stacking chains together, but then your quality of the answer will probably not be too great because I mean, how much can you benefit from splitting the problem into a line of thoughts, right? I mean, at some point there is a limit to it. You also need to go for breadth, not just for the depth of your reasoning chain. So overall, chain of thought, we find that say the boundaries of being effective in terms of the quality of the outcomes, chain of thought is kind of the cheapest one because it's QS tokens, right? But then at the same time, it's also, as we show it, it's not very good when you really require this kind of going for breadth, and you want to decompose the problem because it will make it more effective and the quality better. So it's kind of a trade-off, right? You need to kind of decide that what do I want to do first and what is the most important for me at this stage? In general, what you want to do when you are, let's say, from scratch starting to solve some problem in general, what is the recommended, let's say, way to do is to first optimize for the quality. You want to get the quality as good as possible and then you want to optimize for costs. It would boil down to saying, to doing, let's say, let's have a complex graph because if this is the only way we can solve the problem with top quality, then we go for a complex graph but then we try to simplify, we try to remove as much as possible from the complex structure of reasoning to make it as simple as possible. Now we've been, again, the limit, the level of your accepted quality. - So in the paper, you look at, I think, three use cases sort, something with sets and then a set intersection, and then also I'm moving the third. Within those example or those use cases, how does your methodology compare to the other ones that you explored? - So that's what I mentioned, is that chain of thought or like even the basic IO prompting where you just give the question and you get the answer in a single step of reasoning. So these two were by far by a large margin or let's say not by dead large margin but significantly less expensive or more cost effective. And so few as tokens required. At the same time, they had lowest quality. And that's really because the more complex problems we grew up in terms of, let's say, the input data size, the more these methods were failing to properly solve these problems. Again, because of the sheer amount of data items you have to deal with when you're prompt. And this is what we really focused on. So we took, let's say, a sequence of, let's say, 32 numbers to be sorted and we were increasing steadily to 64, 128 and beyond that. And chain of thought and then simpler methods were just failed to do it. And then tree of thought and the intermediate method like self-consistency and different variants of tree of thought because also we also kind of varied the structure of the tree, right? So we could think of tree as, let's say, like, for example, for a fixed cost, you could have like a very deep tree with fewer branches and a very, very broad, very wide tree with, let's say, lower depth, right? So we tried different variants and tree of thought was, was on par, like, was kind of comparable in terms of cost of tokens. So sometimes it was cheaper. Some other variants were a bit more expensive than graph of thought. In terms of the quality, it was falling behind. And this is, again, because for that class of problems, tree of thought couldn't really combine these sub-solutions into final solutions in an effective way, right? It was, like, tree is, again, about searching, like you want to search for a solution in a search space, right? It's like how we use trees and, like, it's one of the ways we use trees. And for these problems where you want to split and then compose back, we really need these aggregations to actually make it more effective in terms of the quality of outcomes. And it was nicely visible as we scaled the complexity because even if tree was, let's say, reasonable for these less complex problem instances where you had fewer data points in the input, it was more and more visible that the difference is in our favor and the graph of thought favor when you were scaling towards bigger data sizes. And by the way, the reason why we picked a set intersection is that it's a very simple, like, when you think about set intersections, like, yeah, it's like some mathematical operation hookers. But actually, it's underlying a lot of different, very, very practical operations. So, like, a lot of, for example, clustering schemes are really, really dependent on set intersection. A lot of graph problems, very practical ones, like, like, click mining, sub-graph discovery, you know, triangle counting, they can also be abstracted as basically, like a schedule of set operations of different sizes. So that's what we are showing in some other line of works that we are, that we released and I led a couple of years ago, which is called, like, one of them is called prop-graph, the other of them is called SISA and the third one is called graph-mind-suite. It's basically how to use the underlying set abstraction to more effective, more efficient graph computations where we basically show that plenty of graph algorithms. I would even risk of saying that in some domains, the majority of them are, you could write them to nicely benefit from set intersection. - I spend a fair amount of time interacting with large language models on a daily basis and I do what I call prompt engineering, which is trying to write the most clever prompt in the way that I think will get me the answer I want. Can you relate the concept of prompt engineering to graph of thought? - I mean, at this point, it's a bit major in a way that we kind of have nice practices and there are different types of schemes that we kind of know that they work. Or let's say, maybe start from what we've been discussing so far in this podcast, right? So we've been discussing tree of thought and graph of thought and so, like one way, one line of doing better prompting is to apply this structuring of the problem, decomposing the problem to some sort of entities and their relationships, right? So again, you have chain of thought, you have tree of thought, you have graph of thought, right? So this is like one way to make your prompts better is to basically apply chain of thought. Decompose your problem into substeps, into substeps and make the LLM follow these substeps, right? This is like one way of doing it. The other way or the other, let's say, class of strategies for better prompting is just to make your prompt more structured even in this kind of broader sense of how are you writing your reports or your documents? Like if you're writing your report or document, you will use some structure, let's say like sections, subsections, examples, and that's basically if you structure your prompt in that way, you also get better results. Like for example, if you have explicit tags in your prompt, okay, here's the problem introduction, here are the examples, here's my question to you, here's the place where you should answer, again, you will get better results, right? So more structure in a way how you divide your question, your prompt into some subsections or parts. That's the other very basic way of making better prompts and then the whole prompt engineering. Then we are going into more and more, let's say, complex schemes for this prompt engineering, but also more interesting, right? And then more powerful. Like one of them is something called program aided prompting. So in that particular class of schemes, it was shown that what people have been doing with LLMs is can be also seen from a perspective of two steps. In step one, you ask the LLM to decompose your problem into some series of steps. And now we kind of don't care if this is like a chain of fault or something more complex, it's just that in one step, you may ask the LLM, okay, give me these steps to be solved, that I should follow to solve the problem. And then the second part is, okay, now you solve the problem using these steps. And it was shown that interestingly, LLMs are actually quite effective with creating this plan of stuff to be done to solve your question or problem, but then they are actually in many ways struggling with actually executing to do what they just described that should be done. Like following this list of steps is not always as effective as writing down these steps to be taken, right? And now there is this approach, a program-aided prompting where you actually ask the LLM to provide some pseudocode or some code that is going to solve the problem that you have. Maybe like only a few steps of your whole problem, but at some point you want to provide some code, but then solve this code using some, for example, some books environment or some external tool that is going to actually execute it in a more effective way, right? So this is the other class of schemes called program-aided. Then what you have, then there are the whole, whole bunch of schemes about the problem, the composition, like the planning and the composition. So this is like the whole, this is like the bulk of agent works of these days that it's how do we plan to solve some more complex problem, then how do we execute solving this problem, right? So they explicit the composition of your complex problem like planning a trip, right, as you mentioned. We need to book the hotels, we need to kind of have connections between the cities where you are going to sleep, we need to kind of synchronize that all these things have happened in early in time that you are not going to book a hotel in a place where we are flying from right now, but actually in the city where we are flying to for the next night. So this planning the composition into kind of logical, reasonable steps and then executing, right? So this whole planning is another, and then you have designs that are like half prompt-based engineering science and have more about agents, how to actually do it more effectively. So things like designs like ADAPT, where you are kind of applying recursive decomposition, like by the way, very good work. You should take a look at that. Like recursive decomposition into your steps, right? Like recursive decomposition into your steps into sub steps like so you ask the alum, here is something to be solved, solve it. If it's able to solve it out of the box, then you are good, very good. But if it can, then it will split this into sequence of steps to be solved, and then we'll try to again solve all these smaller steps. And again, if it can, very good, if it cannot, for any of them it will again recursively decompose into some other series of steps, right? So you will kind of proceed as deep as you need to solve a given bigger problem. And then one other very, very interesting one that will not maybe be the final one, but a very interesting one is learning-based. Learning understood not as training at the level of weight, but learning understood as refining your prompts over and over and over again by the alum itself, so that it's learning how to make its own intermediate knowledge that is, that we are kind of including in its prompt, more abstract and more powerful, able to solve more general problems, right? And now this could also be combined with retrieval schemes. So this is like the other, you know, now we are going a bit more broad, a bit broader towards the retrieval, but this is also applied to setting without retrieval that does at the prompting level, that you are, for example, fitting the problem in a sense, like solving some equations, like the simplest thing is solving equations, are you throwing a bunch of equations, and it will ask it, okay, look, I want you to solve it, but at the same time, I also want you to kind of give me the most commonly occurring equations or, you know, like transformations of numbers that you are seeing. It will give you this, you know, most frequently occurring transformations, and you will kind of save it into some, let's say, part of the problem that is saving, you know, some rules that it should learn, right? And then you have in your prompt, you have a part which is called like rules for me to follow, right? And then you keep asking other questions, and in the process, again, ask me, I'll give me the answer to my current problem, but at the same time, try to, again, abstract this problem and give me the most commonly occurring patterns or acts or like ways of acting on what I'm asking you to act on. And then combine it with what you already have and this kind of rules that you have been building so far and seeing in this prompt all the time, and refine it, right? So now you have also a couple of schemes in the domain. One of the initial ones was the Simula Accra. So this is like the paper, this kind of the agent paper that, you know, basically shows how to simulate human behavior using different alarms, and they have this kind of very nice, very nice history of interactions of different actors that have been, you know, like like simulated by alarms, and they have their histories of interactions and their thoughts, and they are kind of continually refining these thoughts into more and more abstract thoughts or insights or memories that then they show can be used to build more and more complex behavior, right? So, and then you have more follow-up role works. Like one of the other ones is, I think, called Hypothesis to Theories, where basically it's applied to some mathematical tricks where these LMs are learning, and again, it's a continual way, more and more mathematical rules to apply to more complex problems. - What's next for a graph of thought? - We are working actively on a couple of different threads, and more broadly on the synergy between graphs and a broader alarm landscape. Generative AI landscape, right? So, one way of going forward is to instill the graph reasoning at the training stage, right? So, there are a couple of different approaches for doing that, and we have some cool ideas how to do it in a very interesting way how to do it, in a way that will make this reasoning at the weight level much more powerful than what we have like these days, and also when I speak the LLM, you know, power of reasoning with graphs at the training, they don't really mean using LLMs for graph problems, I mean using LLMs for more general problems, like anything that you want to use, instilling this graph based reasoning is going to kind of apply to arbitrary problems. So, that's one direction of research that we are pushing here. The other one is basically more making graph of thoughts like more agent-like and much more, let's say, white and powerful than what it can use, right? So, like the original graph of thought that we designed is basically the way that you do the structure and use your problems, right? So, basically, how do you schedule your problems and combine these problems into more powerful ones? And now you may want to think, okay, how do I combine this with broader landscape of retrieval, of accessing the web, or running different tools, maybe combining with some psychological effects, you know, with different types of the general agent landscape, but in HANA, how do you apply this graph paradigm? Because you can do it, and this is what you're working on, you can apply this graph paradigm to make this whole landscape of general agents much more effective in how they can solve problems, right? So, this is the other big thread that we are working on, and there are some others, basically, you know, next stage thing, beyond transformers, right? I mean, how do you want to make more powerful models beyond transformers, and that's also what we are looking at. And then there are some other threads that we are also looking at, but I would say that these three are kind of, like the big ones or the interesting ones, right? So-- - Makes sense. - And what about the field of graph theory in general? Are there any emerging or special areas you're excited about that will maybe be hearing interesting things about the next few years? - So, I'm not very much, let's say, I'm not following the whole graph theory fields per se, but I'm following some smaller threads, and one thing that is very interesting for me, is something that is dealing with kind of graphs in terms of their sparsity, and more specifically of graphs, which are neither very sparse nor very dense. You could design a new algorithm that is very, a new algorithm, which is very, very general in what graphs input data sets it's processing. Works for anything, right? And then you also see a lot of designs that are very specific to, for example, sparse graphs or very dense graphs, right? So, for example, Hamiltonian cycle problem is one of these problems, where we have some schemes, for example, to the siding, if a graph has the cycle or not based on its dense, and then if the graph is dense enough, it's very simple to cycle the graph is Hamiltonian, right? Just check, you know, have different ways to do it, based on, for example, degrees of the vertices. But then there is this whole thing of in the middle, which, when you think about it, comes with the most complexity in terms of how many graph instances we have, right? If you, like, there is some number of very dense graphs and there is some number of very sparse graphs, but then in the middle, you have much more these graph instances. And then this is kind of an interesting thing, so there are some theory works about that, about proving the properties of these kind of graphs which are neither very dense nor very sparse. And for me, this is very interesting because it's like, it kind of feels that this is where the main complexity lies. And, you know, this is usually where you can get the most interesting insights and discovers. - And where do you plan to focus your own work and research in the near future? - So I'm usually very consistent with what I do, and, you know, I mean, as you can see, I'm changing domains as well in a radical way, so, you know, like, like, from graphs to LLMs, but at the same time, I'm very consistent and trying to build upon what I know already and experiences, especially if these are the right. I mean, of course, you have to always sometimes decide if I want to maybe try something completely over from scratch that might be the better approach, but at this stage, it looks like applying this relationship-based reasoning is sort of going very well, and it's a very powerful paradigm, which is very underexplored, and so the main threads of research are going to be okay. We have these ideas that we think are very powerful, and these ideas, many of them are based on applying graph paradigm to make them more powerful designs in LLMs, post-transformer models, and the whole training process, and stuff like that. By the way, one more thread that I didn't, I forgot to mention is the performance-centric thread, right? So like these bunch of threads I mentioned of research we are doing now is basically about the concepts and how to make power more powerful models of, let's say, ecosystems of models, but then there's also the inherent aspect about the role performance. I mean, you want to do more in a given time frame for a given amount of resources that you have, and this is another thread of work that we've been working on about, basically, how to make more, for example, communication-efficient LLMs or like training of transformers and stuff like that, right? So this is like the other role performance aspect in there. It's less about graphs. It's more about general sparse compute. So this is one other thread I've been working on more intensely in, let's say, last year. It's basically general sparse computing, right? So we have, you know, like we can do dense matrix operations very, very efficiently at this point, but there is a lot of to be done for sparse computing where you have sparse, irregular, especially various powers, very irregular matrices, that you don't have, you know, very fixed data patterns and non-zero patterns. And this is the other thread, which is very interesting, and that I'm more and more focusing these days on. - And is there any where listeners can follow you online? - It's usually the things that I'm really seeing are visible on the SPCLs, how about the computing lab, on the SPCL websites, and the account that we have there. So I would invite people to follow that. And India to follow, basically, also my online profile, my Google Scholar or website, because I'm also updating these very frequently. - Very good, we'll have links in the show notes for listeners to follow up. Maje, thank you so much for taking the time to come on and share your work. - Thanks as well, for having me. It was a great pleasure, and hopefully talk to you. At some point later, we have more interesting designs that we can discuss together. - Looking forward to it, definitely. (upbeat music) [MUSIC PLAYING]